From: Zi Yan <[email protected]>
Hi all,
This patchset boosts the hugepage migration throughput and helps THP migration
which is added by Naoya's patches: https://lwn.net/Articles/705879/.
Motivation
===============================
In x86, 4KB page migrations are underutilizing the memory bandwidth compared
to 2MB THP migrations. I did some page migration benchmarking on a two-socket
Intel Xeon E5-2640v3 box, which has 23.4GB/s bandwidth, and discover
there are big throughput gap, ~3x, between 4KB and 2MB page migrations.
Here are the throughput numbers for different page sizes and page numbers:
| 512 4KB pages | 1 2MB THP | 1 4KB page
x86_64 | 0.98GB/s | 2.97GB/s | 0.06GB/s
As Linux currently use single-threaded page migration, the throughput is still
much lower than the hardware bandwidth, 2.97GB/s vs 23.4GB/s. So I parallelize
the copy_page() part of THP migration with workqueue and achieve 2.8x throughput.
Here are the throughput numbers of 2MB page migration:
| single-threaded | 8-thread
x86_64 2MB | 2.97GB/s | 8.58GB/s
Here is the benchmark you can use to compare page migration time:
https://github.com/x-y-z/thp-migration-bench
As this patchset requires Naoya's patch, this repo has both patchset applied:
https://github.com/x-y-z/linux-thp-migration/tree/page_migration_opt_upstream
Patchset desciption
===============================
This patchset adds a new migrate_mode MIGRATE_MT, which leads to parallelized
page migration routine. Only copy_huge_page() will be parallelized. This
MIGRATE_MT is enabled by a sysctl knob, vm.accel_page_copy, or an additional
flag, MPOL_MF_MOVE_MT, to move_pages() system call.
The parallelized copy page routine distributes a single huge page into 4
workqueue threads and wait until they finish.
Discussion
===============================
1. For testing purpose, I choose to use sysctl to enable and disable the
parallel huge page migration. I need comments on how to enable and disable it,
or just enable it for all huge page migrations.
2. The hard-coded "4" workqueue threads is not adaptive, any suggestion?
Like boot time benchmark to find an appropriate number?
3. The parallel huge page migration works best with threads allocated at
different physical cores, not all in the same hyper-threaded core. Is there
any way to find out the core topology easily?
Any comments are welcome. Thanks.
--
Best Regards,
Zi Yan
Zi Yan (5):
mm: migrate: Add mode parameter to support additional page copy
routines.
mm: migrate: Change migrate_mode to support combination migration
modes.
migrate: Add copy_page_mt to use multi-threaded page migration.
mm: migrate: Add copy_page_mt into migrate_pages.
mm: migrate: Add vm.accel_page_copy in sysfs to control whether to use
multi-threaded to accelerate page copy.
fs/aio.c | 2 +-
fs/hugetlbfs/inode.c | 2 +-
fs/ubifs/file.c | 2 +-
include/linux/highmem.h | 2 +
include/linux/migrate.h | 6 ++-
include/linux/migrate_mode.h | 7 +--
include/uapi/linux/mempolicy.h | 2 +
kernel/sysctl.c | 12 ++++++
mm/Makefile | 2 +
mm/compaction.c | 20 ++++-----
mm/copy_page.c | 96 ++++++++++++++++++++++++++++++++++++++++++
mm/migrate.c | 61 ++++++++++++++++++---------
12 files changed, 175 insertions(+), 39 deletions(-)
create mode 100644 mm/copy_page.c
--
2.10.2
From: Zi Yan <[email protected]>
From: Zi Yan <[email protected]>
migrate_page_copy() and copy_huge_page() are affected.
Signed-off-by: Zi Yan <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
---
fs/aio.c | 2 +-
fs/hugetlbfs/inode.c | 2 +-
fs/ubifs/file.c | 2 +-
include/linux/migrate.h | 6 ++++--
mm/migrate.c | 14 ++++++++------
5 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 428484f..a67c764 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -418,7 +418,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
* events from being lost.
*/
spin_lock_irqsave(&ctx->completion_lock, flags);
- migrate_page_copy(new, old);
+ migrate_page_copy(new, old, 0);
BUG_ON(ctx->ring_pages[idx] != old);
ctx->ring_pages[idx] = new;
spin_unlock_irqrestore(&ctx->completion_lock, flags);
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 4fb7b10..a17bfef 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -850,7 +850,7 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
rc = migrate_huge_page_move_mapping(mapping, newpage, page);
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
- migrate_page_copy(newpage, page);
+ migrate_page_copy(newpage, page, 0);
return MIGRATEPAGE_SUCCESS;
}
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index b4fbeef..bf54e32 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1468,7 +1468,7 @@ static int ubifs_migrate_page(struct address_space *mapping,
SetPagePrivate(newpage);
}
- migrate_page_copy(newpage, page);
+ migrate_page_copy(newpage, page, 0);
return MIGRATEPAGE_SUCCESS;
}
#endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ae8d475..c78593d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -42,7 +42,8 @@ extern void putback_movable_page(struct page *page);
extern int migrate_prep(void);
extern int migrate_prep_local(void);
-extern void migrate_page_copy(struct page *newpage, struct page *page);
+extern void migrate_page_copy(struct page *newpage, struct page *page,
+ enum migrate_mode mode);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
extern int migrate_page_move_mapping(struct address_space *mapping,
@@ -61,7 +62,8 @@ static inline int migrate_prep(void) { return -ENOSYS; }
static inline int migrate_prep_local(void) { return -ENOSYS; }
static inline void migrate_page_copy(struct page *newpage,
- struct page *page) {}
+ struct page *page,
+ enum migrate_mode mode) {}
static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page)
diff --git a/mm/migrate.c b/mm/migrate.c
index 5bd202c..bc6c1c4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -629,7 +629,8 @@ static void __copy_gigantic_page(struct page *dst, struct page *src,
}
}
-static void copy_huge_page(struct page *dst, struct page *src)
+static void copy_huge_page(struct page *dst, struct page *src,
+ enum migrate_mode mode)
{
int i;
int nr_pages;
@@ -658,12 +659,13 @@ static void copy_huge_page(struct page *dst, struct page *src)
/*
* Copy the page to its new location
*/
-void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_copy(struct page *newpage, struct page *page,
+ enum migrate_mode mode)
{
int cpupid;
if (PageHuge(page) || PageTransHuge(page))
- copy_huge_page(newpage, page);
+ copy_huge_page(newpage, page, mode);
else
copy_highpage(newpage, page);
@@ -745,7 +747,7 @@ int migrate_page(struct address_space *mapping,
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
- migrate_page_copy(newpage, page);
+ migrate_page_copy(newpage, page, mode);
return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL(migrate_page);
@@ -795,7 +797,7 @@ int buffer_migrate_page(struct address_space *mapping,
SetPagePrivate(newpage);
- migrate_page_copy(newpage, page);
+ migrate_page_copy(newpage, page, 0);
bh = head;
do {
@@ -2020,7 +2022,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
/* anon mapping, we can simply copy page->mapping to the new page: */
new_page->mapping = page->mapping;
new_page->index = page->index;
- migrate_page_copy(new_page, page);
+ migrate_page_copy(new_page, page, 0);
WARN_ON(PageLRU(new_page));
/* Recheck the target PMD */
--
2.10.2
From: Zi Yan <[email protected]>
From: Zi Yan <[email protected]>
Since base page migration did not gain any speedup from
multi-threaded methods, we only accelerate the huge page case.
Signed-off-by: Zi Yan <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
---
kernel/sysctl.c | 11 +++++++++++
mm/migrate.c | 6 ++++++
2 files changed, 17 insertions(+)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d54ce12..6c79444 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -98,6 +98,8 @@
#if defined(CONFIG_SYSCTL)
+extern int accel_page_copy;
+
/* External variables not in a header file. */
extern int suid_dumpable;
#ifdef CONFIG_COREDUMP
@@ -1361,6 +1363,15 @@ static struct ctl_table vm_table[] = {
.proc_handler = &hugetlb_mempolicy_sysctl_handler,
},
#endif
+ {
+ .procname = "accel_page_copy",
+ .data = &accel_page_copy,
+ .maxlen = sizeof(accel_page_copy),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
{
.procname = "hugetlb_shm_group",
.data = &sysctl_hugetlb_shm_group,
diff --git a/mm/migrate.c b/mm/migrate.c
index 244ece6..e64b490 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -48,6 +48,8 @@
#include "internal.h"
+int accel_page_copy = 1;
+
/*
* migrate_prep() needs to be called before we start compiling a list of pages
* to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -651,6 +653,10 @@ static void copy_huge_page(struct page *dst, struct page *src,
nr_pages = hpage_nr_pages(src);
}
+ /* Try to accelerate page migration if it is not specified in mode */
+ if (accel_page_copy)
+ mode |= MIGRATE_MT;
+
if (mode & MIGRATE_MT)
rc = copy_page_mt(dst, src, nr_pages);
--
2.10.2
From: Zi Yan <[email protected]>
From: Zi Yan <[email protected]>
No functionality is changed.
Signed-off-by: Zi Yan <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
---
include/linux/migrate_mode.h | 6 +++---
mm/compaction.c | 20 ++++++++++----------
mm/migrate.c | 14 +++++++-------
3 files changed, 20 insertions(+), 20 deletions(-)
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..0e2deb8 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -8,9 +8,9 @@
* MIGRATE_SYNC will block when migrating pages
*/
enum migrate_mode {
- MIGRATE_ASYNC,
- MIGRATE_SYNC_LIGHT,
- MIGRATE_SYNC,
+ MIGRATE_ASYNC = 1<<0,
+ MIGRATE_SYNC_LIGHT = 1<<1,
+ MIGRATE_SYNC = 1<<2,
};
#endif /* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/compaction.c b/mm/compaction.c
index 0409a4a..6606ded 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -296,7 +296,7 @@ static void update_pageblock_skip(struct compact_control *cc,
if (migrate_scanner) {
if (pfn > zone->compact_cached_migrate_pfn[0])
zone->compact_cached_migrate_pfn[0] = pfn;
- if (cc->mode != MIGRATE_ASYNC &&
+ if (!(cc->mode & MIGRATE_ASYNC) &&
pfn > zone->compact_cached_migrate_pfn[1])
zone->compact_cached_migrate_pfn[1] = pfn;
} else {
@@ -329,7 +329,7 @@ static void update_pageblock_skip(struct compact_control *cc,
static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
struct compact_control *cc)
{
- if (cc->mode == MIGRATE_ASYNC) {
+ if (cc->mode & MIGRATE_ASYNC) {
if (!spin_trylock_irqsave(lock, *flags)) {
cc->contended = true;
return false;
@@ -370,7 +370,7 @@ static bool compact_unlock_should_abort(spinlock_t *lock,
}
if (need_resched()) {
- if (cc->mode == MIGRATE_ASYNC) {
+ if (cc->mode & MIGRATE_ASYNC) {
cc->contended = true;
return true;
}
@@ -393,7 +393,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
{
/* async compaction aborts if contended */
if (need_resched()) {
- if (cc->mode == MIGRATE_ASYNC) {
+ if (cc->mode & MIGRATE_ASYNC) {
cc->contended = true;
return true;
}
@@ -704,7 +704,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
*/
while (unlikely(too_many_isolated(zone))) {
/* async migration should just abort */
- if (cc->mode == MIGRATE_ASYNC)
+ if (cc->mode & MIGRATE_ASYNC)
return 0;
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -716,7 +716,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (compact_should_abort(cc))
return 0;
- if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
+ if (cc->direct_compaction && (cc->mode & MIGRATE_ASYNC)) {
skip_on_failure = true;
next_skip_pfn = block_end_pfn(low_pfn, cc->order);
}
@@ -1204,7 +1204,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct page *page;
const isolate_mode_t isolate_mode =
(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
- (cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
+ (!(cc->mode & MIGRATE_SYNC) ? ISOLATE_ASYNC_MIGRATE : 0);
/*
* Start at where we last stopped, or beginning of the zone as
@@ -1250,7 +1250,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
* Async compaction is optimistic to see if the minimum amount
* of work satisfies the allocation.
*/
- if (cc->mode == MIGRATE_ASYNC &&
+ if ((cc->mode & MIGRATE_ASYNC) &&
!migrate_async_suitable(get_pageblock_migratetype(page)))
continue;
@@ -1493,7 +1493,7 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
unsigned long start_pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
- const bool sync = cc->mode != MIGRATE_ASYNC;
+ const bool sync = !(cc->mode & MIGRATE_ASYNC);
ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
cc->classzone_idx);
@@ -1589,7 +1589,7 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
* order-aligned block, so skip the rest of it.
*/
if (cc->direct_compaction &&
- (cc->mode == MIGRATE_ASYNC)) {
+ (cc->mode & MIGRATE_ASYNC)) {
cc->migrate_pfn = block_end_pfn(
cc->migrate_pfn - 1, cc->order);
/* Draining pcplists is useless in this case */
diff --git a/mm/migrate.c b/mm/migrate.c
index bc6c1c4..4a4cf48 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -394,7 +394,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
struct buffer_head *bh = head;
/* Simple case, sync compaction */
- if (mode != MIGRATE_ASYNC) {
+ if (!(mode & MIGRATE_ASYNC)) {
do {
get_bh(bh);
lock_buffer(bh);
@@ -495,7 +495,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if (mode == MIGRATE_ASYNC && head &&
+ if ((mode & MIGRATE_ASYNC) && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_ref_unfreeze(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
@@ -779,7 +779,7 @@ int buffer_migrate_page(struct address_space *mapping,
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (mode != MIGRATE_ASYNC)
+ if (!(mode & MIGRATE_ASYNC))
BUG_ON(!buffer_migrate_lock_buffers(head, mode));
ClearPagePrivate(page);
@@ -861,7 +861,7 @@ static int fallback_migrate_page(struct address_space *mapping,
{
if (PageDirty(page)) {
/* Only writeback pages in full synchronous migration */
- if (mode != MIGRATE_SYNC)
+ if (!(mode & MIGRATE_SYNC))
return -EBUSY;
return writeout(mapping, page);
}
@@ -970,7 +970,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
bool is_lru = !__PageMovable(page);
if (!trylock_page(page)) {
- if (!force || mode == MIGRATE_ASYNC)
+ if (!force || (mode & MIGRATE_ASYNC))
goto out;
/*
@@ -999,7 +999,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
* the retry loop is too short and in the sync-light case,
* the overhead of stalling is too much
*/
- if (mode != MIGRATE_SYNC) {
+ if (!(mode & MIGRATE_SYNC)) {
rc = -EBUSY;
goto out_unlock;
}
@@ -1262,7 +1262,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
return -ENOMEM;
if (!trylock_page(hpage)) {
- if (!force || mode != MIGRATE_SYNC)
+ if (!force || !(mode & MIGRATE_SYNC))
goto out;
lock_page(hpage);
}
--
2.10.2
From: Zi Yan <[email protected]>
From: Zi Yan <[email protected]>
Internally, copy_page_mt splits a page into multiple threads
and send them as jobs to system_highpri_wq.
Signed-off-by: Zi Yan <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
---
include/linux/highmem.h | 2 ++
kernel/sysctl.c | 1 +
mm/Makefile | 2 ++
mm/copy_page.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 101 insertions(+)
create mode 100644 mm/copy_page.c
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f329..519e575 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -236,6 +236,8 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
#endif
+int copy_page_mt(struct page *to, struct page *from, int nr_pages);
+
static inline void copy_highpage(struct page *to, struct page *from)
{
char *vfrom, *vto;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 706309f..d54ce12 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -97,6 +97,7 @@
#if defined(CONFIG_SYSCTL)
+
/* External variables not in a header file. */
extern int suid_dumpable;
#ifdef CONFIG_COREDUMP
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a..467305b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -41,6 +41,8 @@ obj-y := filemap.o mempool.o oom_kill.o \
obj-y += init-mm.o
+obj-y += copy_page.o
+
ifdef CONFIG_NO_BOOTMEM
obj-y += nobootmem.o
else
diff --git a/mm/copy_page.c b/mm/copy_page.c
new file mode 100644
index 0000000..ca7ce6c
--- /dev/null
+++ b/mm/copy_page.c
@@ -0,0 +1,96 @@
+/*
+ * Parallel page copy routine.
+ *
+ * Zi Yan <[email protected]>
+ *
+ */
+
+#include <linux/highmem.h>
+#include <linux/workqueue.h>
+#include <linux/slab.h>
+#include <linux/freezer.h>
+
+
+const unsigned int limit_mt_num = 4;
+
+/* ======================== multi-threaded copy page ======================== */
+
+struct copy_page_info {
+ struct work_struct copy_page_work;
+ char *to;
+ char *from;
+ unsigned long chunk_size;
+};
+
+static void copy_page_routine(char *vto, char *vfrom,
+ unsigned long chunk_size)
+{
+ memcpy(vto, vfrom, chunk_size);
+}
+
+static void copy_page_work_queue_thread(struct work_struct *work)
+{
+ struct copy_page_info *my_work = (struct copy_page_info *)work;
+
+ copy_page_routine(my_work->to,
+ my_work->from,
+ my_work->chunk_size);
+}
+
+int copy_page_mt(struct page *to, struct page *from, int nr_pages)
+{
+ unsigned int total_mt_num = limit_mt_num;
+ int to_node = page_to_nid(to);
+ int i;
+ struct copy_page_info *work_items;
+ char *vto, *vfrom;
+ unsigned long chunk_size;
+ const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
+ int cpu_id_list[32] = {0};
+ int cpu;
+
+ total_mt_num = min_t(unsigned int, total_mt_num,
+ cpumask_weight(per_node_cpumask));
+ total_mt_num = (total_mt_num / 2) * 2;
+
+ work_items = kcalloc(total_mt_num, sizeof(struct copy_page_info),
+ GFP_KERNEL);
+ if (!work_items)
+ return -ENOMEM;
+
+ i = 0;
+ for_each_cpu(cpu, per_node_cpumask) {
+ if (i >= total_mt_num)
+ break;
+ cpu_id_list[i] = cpu;
+ ++i;
+ }
+
+ vfrom = kmap(from);
+ vto = kmap(to);
+ chunk_size = PAGE_SIZE*nr_pages / total_mt_num;
+
+ for (i = 0; i < total_mt_num; ++i) {
+ INIT_WORK((struct work_struct *)&work_items[i],
+ copy_page_work_queue_thread);
+
+ work_items[i].to = vto + i * chunk_size;
+ work_items[i].from = vfrom + i * chunk_size;
+ work_items[i].chunk_size = chunk_size;
+
+ queue_work_on(cpu_id_list[i],
+ system_highpri_wq,
+ (struct work_struct *)&work_items[i]);
+ }
+
+ /* Wait until it finishes */
+ for (i = 0; i < total_mt_num; ++i)
+ flush_work((struct work_struct *)&work_items[i]);
+
+ kunmap(to);
+ kunmap(from);
+
+ kfree(work_items);
+
+ return 0;
+}
--
2.10.2
From: Zi Yan <[email protected]>
From: Zi Yan <[email protected]>
An option is added to move_pages() syscall to use multi-threaded
page migration.
Signed-off-by: Zi Yan <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
---
include/linux/migrate_mode.h | 1 +
include/uapi/linux/mempolicy.h | 2 ++
mm/migrate.c | 27 +++++++++++++++++++--------
3 files changed, 22 insertions(+), 8 deletions(-)
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 0e2deb8..c711e2a 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -11,6 +11,7 @@ enum migrate_mode {
MIGRATE_ASYNC = 1<<0,
MIGRATE_SYNC_LIGHT = 1<<1,
MIGRATE_SYNC = 1<<2,
+ MIGRATE_MT = 1<<3,
};
#endif /* MIGRATE_MODE_H_INCLUDED */
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 9cd8b21..5d42dc6 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -54,6 +54,8 @@ enum mpol_rebind_step {
#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */
#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */
+#define MPOL_MF_MOVE_MT (1<<6) /* Use multi-threaded page copy routine */
+
#define MPOL_MF_VALID (MPOL_MF_STRICT | \
MPOL_MF_MOVE | \
MPOL_MF_MOVE_ALL)
diff --git a/mm/migrate.c b/mm/migrate.c
index 4a4cf48..244ece6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -634,6 +634,7 @@ static void copy_huge_page(struct page *dst, struct page *src,
{
int i;
int nr_pages;
+ int rc = -EFAULT;
if (PageHuge(src)) {
/* hugetlbfs page */
@@ -650,10 +651,14 @@ static void copy_huge_page(struct page *dst, struct page *src,
nr_pages = hpage_nr_pages(src);
}
- for (i = 0; i < nr_pages; i++) {
- cond_resched();
- copy_highpage(dst + i, src + i);
- }
+ if (mode & MIGRATE_MT)
+ rc = copy_page_mt(dst, src, nr_pages);
+
+ if (rc)
+ for (i = 0; i < nr_pages; i++) {
+ cond_resched();
+ copy_highpage(dst + i, src + i);
+ }
}
/*
@@ -1461,11 +1466,16 @@ static struct page *new_page_node(struct page *p, unsigned long private,
*/
static int do_move_page_to_node_array(struct mm_struct *mm,
struct page_to_node *pm,
- int migrate_all)
+ int migrate_all,
+ int migrate_use_mt)
{
int err;
struct page_to_node *pp;
LIST_HEAD(pagelist);
+ enum migrate_mode mode = MIGRATE_SYNC;
+
+ if (migrate_use_mt)
+ mode |= MIGRATE_MT;
down_read(&mm->mmap_sem);
@@ -1542,7 +1552,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
err = 0;
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_page_node, NULL,
- (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
+ (unsigned long)pm, mode, MR_SYSCALL);
if (err)
putback_movable_pages(&pagelist);
}
@@ -1619,7 +1629,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
/* Migrate this chunk */
err = do_move_page_to_node_array(mm, pm,
- flags & MPOL_MF_MOVE_ALL);
+ flags & MPOL_MF_MOVE_ALL,
+ flags & MPOL_MF_MOVE_MT);
if (err < 0)
goto out_pm;
@@ -1726,7 +1737,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
nodemask_t task_nodes;
/* Check flags */
- if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
+ if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_MOVE_MT))
return -EINVAL;
if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
--
2.10.2
Hi Zi,
[auto build test WARNING on linus/master]
[also build test WARNING on v4.9-rc6 next-20161122]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Zi-Yan/Parallel-hugepage-migration-optimization/20161123-022913
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
sparse warnings: (new ones prefixed by >>)
include/linux/compiler.h:253:8: sparse: attribute 'no_sanitize_address': unknown attribute
>> fs/f2fs/data.c:1938:26: sparse: not enough arguments for function migrate_page_copy
fs/f2fs/data.c: In function 'f2fs_migrate_page':
fs/f2fs/data.c:1938:2: error: too few arguments to function 'migrate_page_copy'
migrate_page_copy(newpage, page);
^~~~~~~~~~~~~~~~~
In file included from fs/f2fs/data.c:1893:0:
include/linux/migrate.h:45:13: note: declared here
extern void migrate_page_copy(struct page *newpage, struct page *page,
^~~~~~~~~~~~~~~~~
vim +1938 fs/f2fs/data.c
5b7a487c Weichao Guo 2016-09-20 1922 if (atomic_written) {
5b7a487c Weichao Guo 2016-09-20 1923 struct inmem_pages *cur;
5b7a487c Weichao Guo 2016-09-20 1924 list_for_each_entry(cur, &fi->inmem_pages, list)
5b7a487c Weichao Guo 2016-09-20 1925 if (cur->page == page) {
5b7a487c Weichao Guo 2016-09-20 1926 cur->page = newpage;
5b7a487c Weichao Guo 2016-09-20 1927 break;
5b7a487c Weichao Guo 2016-09-20 1928 }
5b7a487c Weichao Guo 2016-09-20 1929 mutex_unlock(&fi->inmem_lock);
5b7a487c Weichao Guo 2016-09-20 1930 put_page(page);
5b7a487c Weichao Guo 2016-09-20 1931 get_page(newpage);
5b7a487c Weichao Guo 2016-09-20 1932 }
5b7a487c Weichao Guo 2016-09-20 1933
5b7a487c Weichao Guo 2016-09-20 1934 if (PagePrivate(page))
5b7a487c Weichao Guo 2016-09-20 1935 SetPagePrivate(newpage);
5b7a487c Weichao Guo 2016-09-20 1936 set_page_private(newpage, page_private(page));
5b7a487c Weichao Guo 2016-09-20 1937
5b7a487c Weichao Guo 2016-09-20 @1938 migrate_page_copy(newpage, page);
5b7a487c Weichao Guo 2016-09-20 1939
5b7a487c Weichao Guo 2016-09-20 1940 return MIGRATEPAGE_SUCCESS;
5b7a487c Weichao Guo 2016-09-20 1941 }
5b7a487c Weichao Guo 2016-09-20 1942 #endif
5b7a487c Weichao Guo 2016-09-20 1943
eb47b800 Jaegeuk Kim 2012-11-02 1944 const struct address_space_operations f2fs_dblock_aops = {
eb47b800 Jaegeuk Kim 2012-11-02 1945 .readpage = f2fs_read_data_page,
eb47b800 Jaegeuk Kim 2012-11-02 1946 .readpages = f2fs_read_data_pages,
:::::: The code at line 1938 was first introduced by commit
:::::: 5b7a487cf32d3a266fea83d590d3226b5ad817a7 f2fs: add customized migrate_page callback
:::::: TO: Weichao Guo <[email protected]>
:::::: CC: Jaegeuk Kim <[email protected]>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 11/23/2016 01:26 AM, kbuild test robot wrote:
> Hi Zi,
>
> [auto build test WARNING on linus/master]
> [also build test WARNING on v4.9-rc6 next-20161122]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Zi-Yan/Parallel-hugepage-migration-optimization/20161123-022913
> reproduce:
> # apt-get install sparse
> make ARCH=x86_64 allmodconfig
> make C=1 CF=-D__CHECK_ENDIAN__
>
>
> sparse warnings: (new ones prefixed by >>)
>
> include/linux/compiler.h:253:8: sparse: attribute 'no_sanitize_address': unknown attribute
>>> >> fs/f2fs/data.c:1938:26: sparse: not enough arguments for function migrate_page_copy
> fs/f2fs/data.c: In function 'f2fs_migrate_page':
> fs/f2fs/data.c:1938:2: error: too few arguments to function 'migrate_page_copy'
> migrate_page_copy(newpage, page);
Yeah, this got missed which needs to be fixed.
On 11/22/2016 09:55 PM, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> From: Zi Yan <[email protected]>
There are multiple "from" for this patch, should be fixed to reflect
just one of them.
>
> migrate_page_copy() and copy_huge_page() are affected.
In this patch you are just expanding the arguments of both of these
functions to include "migration mode" which will then be implemented
later by subsequent patches or should these patches me merged ? I
guess its okay.
Please reword the commit message to include details.
>
> Signed-off-by: Zi Yan <[email protected]>
> Signed-off-by: Zi Yan <[email protected]>
Just a singled Signed-off-by.
On 11/22/2016 09:55 PM, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> From: Zi Yan <[email protected]>
>
> No functionality is changed.
The commit message need to contains more details like it changes
the enum declaration from numbers to bit positions, where all it
changes existing code like compaction and migration.
>
> Signed-off-by: Zi Yan <[email protected]>
> Signed-off-by: Zi Yan <[email protected]>
Like last patch please fix the author details and signed offs.
> ---
> include/linux/migrate_mode.h | 6 +++---
> mm/compaction.c | 20 ++++++++++----------
> mm/migrate.c | 14 +++++++-------
> 3 files changed, 20 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
> index ebf3d89..0e2deb8 100644
> --- a/include/linux/migrate_mode.h
> +++ b/include/linux/migrate_mode.h
> @@ -8,9 +8,9 @@
> * MIGRATE_SYNC will block when migrating pages
> */
> enum migrate_mode {
> - MIGRATE_ASYNC,
> - MIGRATE_SYNC_LIGHT,
> - MIGRATE_SYNC,
> + MIGRATE_ASYNC = 1<<0,
> + MIGRATE_SYNC_LIGHT = 1<<1,
> + MIGRATE_SYNC = 1<<2,
Right, so that they can be ORed with each other.
> };
>
> #endif /* MIGRATE_MODE_H_INCLUDED */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 0409a4a..6606ded 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -296,7 +296,7 @@ static void update_pageblock_skip(struct compact_control *cc,
> if (migrate_scanner) {
> if (pfn > zone->compact_cached_migrate_pfn[0])
> zone->compact_cached_migrate_pfn[0] = pfn;
> - if (cc->mode != MIGRATE_ASYNC &&
> + if (!(cc->mode & MIGRATE_ASYNC) &&
> pfn > zone->compact_cached_migrate_pfn[1])
> zone->compact_cached_migrate_pfn[1] = pfn;
> } else {
> @@ -329,7 +329,7 @@ static void update_pageblock_skip(struct compact_control *cc,
> static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
> struct compact_control *cc)
> {
> - if (cc->mode == MIGRATE_ASYNC) {
> + if (cc->mode & MIGRATE_ASYNC) {
> if (!spin_trylock_irqsave(lock, *flags)) {
> cc->contended = true;
> return false;
> @@ -370,7 +370,7 @@ static bool compact_unlock_should_abort(spinlock_t *lock,
> }
>
> if (need_resched()) {
> - if (cc->mode == MIGRATE_ASYNC) {
> + if (cc->mode & MIGRATE_ASYNC) {
> cc->contended = true;
> return true;
> }
> @@ -393,7 +393,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
> {
> /* async compaction aborts if contended */
> if (need_resched()) {
> - if (cc->mode == MIGRATE_ASYNC) {
> + if (cc->mode & MIGRATE_ASYNC) {
> cc->contended = true;
> return true;
> }
> @@ -704,7 +704,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> */
> while (unlikely(too_many_isolated(zone))) {
> /* async migration should just abort */
> - if (cc->mode == MIGRATE_ASYNC)
> + if (cc->mode & MIGRATE_ASYNC)
> return 0;
>
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -716,7 +716,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> if (compact_should_abort(cc))
> return 0;
>
> - if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
> + if (cc->direct_compaction && (cc->mode & MIGRATE_ASYNC)) {
> skip_on_failure = true;
> next_skip_pfn = block_end_pfn(low_pfn, cc->order);
> }
> @@ -1204,7 +1204,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
> struct page *page;
> const isolate_mode_t isolate_mode =
> (sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
> - (cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
> + (!(cc->mode & MIGRATE_SYNC) ? ISOLATE_ASYNC_MIGRATE : 0);
>
> /*
> * Start at where we last stopped, or beginning of the zone as
> @@ -1250,7 +1250,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
> * Async compaction is optimistic to see if the minimum amount
> * of work satisfies the allocation.
> */
> - if (cc->mode == MIGRATE_ASYNC &&
> + if ((cc->mode & MIGRATE_ASYNC) &&
> !migrate_async_suitable(get_pageblock_migratetype(page)))
> continue;
>
> @@ -1493,7 +1493,7 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
> unsigned long start_pfn = zone->zone_start_pfn;
> unsigned long end_pfn = zone_end_pfn(zone);
> const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
> - const bool sync = cc->mode != MIGRATE_ASYNC;
> + const bool sync = !(cc->mode & MIGRATE_ASYNC);
>
> ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
> cc->classzone_idx);
> @@ -1589,7 +1589,7 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
> * order-aligned block, so skip the rest of it.
> */
> if (cc->direct_compaction &&
> - (cc->mode == MIGRATE_ASYNC)) {
> + (cc->mode & MIGRATE_ASYNC)) {
> cc->migrate_pfn = block_end_pfn(
> cc->migrate_pfn - 1, cc->order);
> /* Draining pcplists is useless in this case */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index bc6c1c4..4a4cf48 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -394,7 +394,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> struct buffer_head *bh = head;
>
> /* Simple case, sync compaction */
> - if (mode != MIGRATE_ASYNC) {
> + if (!(mode & MIGRATE_ASYNC)) {
> do {
> get_bh(bh);
> lock_buffer(bh);
> @@ -495,7 +495,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
> * the mapping back due to an elevated page count, we would have to
> * block waiting on other references to be dropped.
> */
> - if (mode == MIGRATE_ASYNC && head &&
> + if ((mode & MIGRATE_ASYNC) && head &&
> !buffer_migrate_lock_buffers(head, mode)) {
> page_ref_unfreeze(page, expected_count);
> spin_unlock_irq(&mapping->tree_lock);
> @@ -779,7 +779,7 @@ int buffer_migrate_page(struct address_space *mapping,
> * with an IRQ-safe spinlock held. In the sync case, the buffers
> * need to be locked now
> */
> - if (mode != MIGRATE_ASYNC)
> + if (!(mode & MIGRATE_ASYNC))
> BUG_ON(!buffer_migrate_lock_buffers(head, mode));
>
> ClearPagePrivate(page);
> @@ -861,7 +861,7 @@ static int fallback_migrate_page(struct address_space *mapping,
> {
> if (PageDirty(page)) {
> /* Only writeback pages in full synchronous migration */
> - if (mode != MIGRATE_SYNC)
> + if (!(mode & MIGRATE_SYNC))
> return -EBUSY;
> return writeout(mapping, page);
> }
> @@ -970,7 +970,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> bool is_lru = !__PageMovable(page);
>
> if (!trylock_page(page)) {
> - if (!force || mode == MIGRATE_ASYNC)
> + if (!force || (mode & MIGRATE_ASYNC))
> goto out;
>
> /*
> @@ -999,7 +999,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> * the retry loop is too short and in the sync-light case,
> * the overhead of stalling is too much
> */
> - if (mode != MIGRATE_SYNC) {
> + if (!(mode & MIGRATE_SYNC)) {
> rc = -EBUSY;
> goto out_unlock;
> }
> @@ -1262,7 +1262,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> return -ENOMEM;
>
> if (!trylock_page(hpage)) {
> - if (!force || mode != MIGRATE_SYNC)
> + if (!force || !(mode & MIGRATE_SYNC))
> goto out;
> lock_page(hpage);
> }
So here are the conversions
(mode == MIGRATE_SYNC) ---> (mode & MIGRATE_SYNC)
(mode != MIGRATE_SYNC) ---> !(mode & MIGRATE_SYNC)
It should be okay.
On 11/22/2016 09:55 PM, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> From: Zi Yan <[email protected]>
Please fix these.
>
> Internally, copy_page_mt splits a page into multiple threads
> and send them as jobs to system_highpri_wq.
The function should be renamed as copy_page_multithread() or at
the least copy_page_mthread() to make more sense. The commit
message needs to more comprehensive and detailed.
>
> Signed-off-by: Zi Yan <[email protected]>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> include/linux/highmem.h | 2 ++
> kernel/sysctl.c | 1 +
> mm/Makefile | 2 ++
> mm/copy_page.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 101 insertions(+)
> create mode 100644 mm/copy_page.c
>
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f329..519e575 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -236,6 +236,8 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>
> #endif
>
> +int copy_page_mt(struct page *to, struct page *from, int nr_pages);
> +
> static inline void copy_highpage(struct page *to, struct page *from)
> {
> char *vfrom, *vto;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 706309f..d54ce12 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -97,6 +97,7 @@
>
> #if defined(CONFIG_SYSCTL)
>
> +
I guess this is a stray code change.
> /* External variables not in a header file. */
> extern int suid_dumpable;
> #ifdef CONFIG_COREDUMP
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a..467305b 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -41,6 +41,8 @@ obj-y := filemap.o mempool.o oom_kill.o \
>
> obj-y += init-mm.o
>
> +obj-y += copy_page.o
Its getting compiled all the time. Dont you want to make it part of
of a new config option which will cover for all these code for multi
thread copy ?
> +
> ifdef CONFIG_NO_BOOTMEM
> obj-y += nobootmem.o
> else
> diff --git a/mm/copy_page.c b/mm/copy_page.c
> new file mode 100644
> index 0000000..ca7ce6c
> --- /dev/null
> +++ b/mm/copy_page.c
> @@ -0,0 +1,96 @@
> +/*
> + * Parallel page copy routine.
> + *
> + * Zi Yan <[email protected]>
> + *
> + */
No, this is too less. Please see other files inside mm directory as
example.
> +
> +#include <linux/highmem.h>
> +#include <linux/workqueue.h>
> +#include <linux/slab.h>
> +#include <linux/freezer.h>
> +
> +
> +const unsigned int limit_mt_num = 4;
>From where this number 4 came from ? At the very least it should be
configured from either a sysctl variable or from a sysfs file, so
that user will have control on number of threads used for copy. But
going forward this should be derived out a arch specific call back
which then analyzes NUMA topology and scheduler loads to figure out
on how many threads should be used for optimum performance of page
copy.
> +
> +/* ======================== multi-threaded copy page ======================== */
> +
Please use standard exported function description semantics while
describing this new function. I think its a good function to be
exported as a symbol as well.
> +struct copy_page_info {
s/copy_page_info/mthread_copy_struct/
> + struct work_struct copy_page_work;
> + char *to;
> + char *from;
Swap the order of 'to' and 'from'.
> + unsigned long chunk_size;
Just 'size' should be fine.
> +};
> +
> +static void copy_page_routine(char *vto, char *vfrom,
s/copy_page_routine/mthread_copy_fn/
> + unsigned long chunk_size)
> +{
> + memcpy(vto, vfrom, chunk_size);
> +}
s/chunk_size/size/
> +
> +static void copy_page_work_queue_thread(struct work_struct *work)
> +{
> + struct copy_page_info *my_work = (struct copy_page_info *)work;
> +
> + copy_page_routine(my_work->to,
> + my_work->from,
> + my_work->chunk_size);
> +}
> +
> +int copy_page_mt(struct page *to, struct page *from, int nr_pages)
> +{
> + unsigned int total_mt_num = limit_mt_num;
> + int to_node = page_to_nid(to);
Should we make sure that the entire page range [to, to + nr_pages] is
part of to_node.
> + int i;
> + struct copy_page_info *work_items;
> + char *vto, *vfrom;
> + unsigned long chunk_size;
> + const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
So all the threads used for copy has to be part of cpumask of the
destination node ? Why ? The copy accesses both the source pages as
well as destination pages. Source node threads might also perform
good for the memory accesses. Which and how many threads should be
used for copy should be decided wisely from an architecture call
back. On a NUMA system this will have impact on performance of the
multi threaded copy.
> + int cpu_id_list[32] = {0};
> + int cpu;
> +
> + total_mt_num = min_t(unsigned int, total_mt_num,
> + cpumask_weight(per_node_cpumask));
> + total_mt_num = (total_mt_num / 2) * 2;
> +
> + work_items = kcalloc(total_mt_num, sizeof(struct copy_page_info),
> + GFP_KERNEL);
> + if (!work_items)
> + return -ENOMEM;
> +
> + i = 0;
> + for_each_cpu(cpu, per_node_cpumask) {
> + if (i >= total_mt_num)
> + break;
> + cpu_id_list[i] = cpu;
> + ++i;
> + }
> +
> + vfrom = kmap(from);
> + vto = kmap(to);
> + chunk_size = PAGE_SIZE*nr_pages / total_mt_num;
Coding style ? Please run all these patches though scripts/
checkpatch.pl script to catch coding style problems.
> +
> + for (i = 0; i < total_mt_num; ++i) {
> + INIT_WORK((struct work_struct *)&work_items[i],
> + copy_page_work_queue_thread);
> +
> + work_items[i].to = vto + i * chunk_size;
> + work_items[i].from = vfrom + i * chunk_size;
> + work_items[i].chunk_size = chunk_size;
> +
> + queue_work_on(cpu_id_list[i],
> + system_highpri_wq,
> + (struct work_struct *)&work_items[i]);
I am not very familiar with the system work queues but is
system_highpri_wq has the highest priority ? Because if
the time spend waiting on these work queue functions to
execute increases it can offset out all the benefits we
get by this multi threaded copy.
On 11/22/2016 09:55 PM, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> From: Zi Yan <[email protected]>
>
> An option is added to move_pages() syscall to use multi-threaded
> page migration.
>
> Signed-off-by: Zi Yan <[email protected]>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> include/linux/migrate_mode.h | 1 +
> include/uapi/linux/mempolicy.h | 2 ++
> mm/migrate.c | 27 +++++++++++++++++++--------
> 3 files changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
> index 0e2deb8..c711e2a 100644
> --- a/include/linux/migrate_mode.h
> +++ b/include/linux/migrate_mode.h
> @@ -11,6 +11,7 @@ enum migrate_mode {
> MIGRATE_ASYNC = 1<<0,
> MIGRATE_SYNC_LIGHT = 1<<1,
> MIGRATE_SYNC = 1<<2,
> + MIGRATE_MT = 1<<3,
MIGRATE_MTHREAD should be better.
> };
>
> #endif /* MIGRATE_MODE_H_INCLUDED */
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 9cd8b21..5d42dc6 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -54,6 +54,8 @@ enum mpol_rebind_step {
> #define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */
> #define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */
>
> +#define MPOL_MF_MOVE_MT (1<<6) /* Use multi-threaded page copy routine */
> +
s/MPOL_MF_MOVE_MT/MPOL_MF_MOVE_MTHREAD/
> #define MPOL_MF_VALID (MPOL_MF_STRICT | \
> MPOL_MF_MOVE | \
> MPOL_MF_MOVE_ALL)
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4a4cf48..244ece6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -634,6 +634,7 @@ static void copy_huge_page(struct page *dst, struct page *src,
> {
> int i;
> int nr_pages;
> + int rc = -EFAULT;
>
> if (PageHuge(src)) {
> /* hugetlbfs page */
> @@ -650,10 +651,14 @@ static void copy_huge_page(struct page *dst, struct page *src,
> nr_pages = hpage_nr_pages(src);
> }
>
> - for (i = 0; i < nr_pages; i++) {
> - cond_resched();
> - copy_highpage(dst + i, src + i);
> - }
> + if (mode & MIGRATE_MT)
> + rc = copy_page_mt(dst, src, nr_pages);
> +
> + if (rc)
> + for (i = 0; i < nr_pages; i++) {
> + cond_resched();
> + copy_highpage(dst + i, src + i);
> + }
> }
So this is the case where MIGRATE_MT is mentioned or when it fails.
A small documentation above the code block should be good.
>
> /*
> @@ -1461,11 +1466,16 @@ static struct page *new_page_node(struct page *p, unsigned long private,
> */
> static int do_move_page_to_node_array(struct mm_struct *mm,
> struct page_to_node *pm,
> - int migrate_all)
> + int migrate_all,
> + int migrate_use_mt)
> {
> int err;
> struct page_to_node *pp;
> LIST_HEAD(pagelist);
> + enum migrate_mode mode = MIGRATE_SYNC;
> +
> + if (migrate_use_mt)
> + mode |= MIGRATE_MT;
>
> down_read(&mm->mmap_sem);
>
> @@ -1542,7 +1552,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
> err = 0;
> if (!list_empty(&pagelist)) {
> err = migrate_pages(&pagelist, new_page_node, NULL,
> - (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
> + (unsigned long)pm, mode, MR_SYSCALL);
> if (err)
> putback_movable_pages(&pagelist);
> }
> @@ -1619,7 +1629,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
>
> /* Migrate this chunk */
> err = do_move_page_to_node_array(mm, pm,
> - flags & MPOL_MF_MOVE_ALL);
> + flags & MPOL_MF_MOVE_ALL,
> + flags & MPOL_MF_MOVE_MT);
> if (err < 0)
> goto out_pm;
>
> @@ -1726,7 +1737,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
> nodemask_t task_nodes;
>
> /* Check flags */
> - if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
> + if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_MOVE_MT))
> return -EINVAL;
Wondering if do_move_pages_to_node_array() is the only place which
needs to be changed to accommodate this.
On 11/22/2016 09:55 PM, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> From: Zi Yan <[email protected]>
>
> Since base page migration did not gain any speedup from
> multi-threaded methods, we only accelerate the huge page case.
>
> Signed-off-by: Zi Yan <[email protected]>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> kernel/sysctl.c | 11 +++++++++++
> mm/migrate.c | 6 ++++++
> 2 files changed, 17 insertions(+)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index d54ce12..6c79444 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -98,6 +98,8 @@
> #if defined(CONFIG_SYSCTL)
>
>
> +extern int accel_page_copy;
Hmm, accel_mthread_copy because this is achieved by a multi threaded
copy mechanism.
> +
> /* External variables not in a header file. */
> extern int suid_dumpable;
> #ifdef CONFIG_COREDUMP
> @@ -1361,6 +1363,15 @@ static struct ctl_table vm_table[] = {
> .proc_handler = &hugetlb_mempolicy_sysctl_handler,
> },
> #endif
> + {
> + .procname = "accel_page_copy",
> + .data = &accel_page_copy,
> + .maxlen = sizeof(accel_page_copy),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + .extra1 = &zero,
> + .extra2 = &one,
> + },
> {
> .procname = "hugetlb_shm_group",
> .data = &sysctl_hugetlb_shm_group,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 244ece6..e64b490 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -48,6 +48,8 @@
>
> #include "internal.h"
>
> +int accel_page_copy = 1;
> +
So its enabled by default.
> /*
> * migrate_prep() needs to be called before we start compiling a list of pages
> * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> @@ -651,6 +653,10 @@ static void copy_huge_page(struct page *dst, struct page *src,
> nr_pages = hpage_nr_pages(src);
> }
>
> + /* Try to accelerate page migration if it is not specified in mode */
> + if (accel_page_copy)
> + mode |= MIGRATE_MT;
So even if none of the system calls requested for a multi threaded copy,
this setting will override every thing and make it multi threaded.
On 23/11/16 03:25, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> From: Zi Yan <[email protected]>
>
> migrate_page_copy() and copy_huge_page() are affected.
>
> Signed-off-by: Zi Yan <[email protected]>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> fs/aio.c | 2 +-
> fs/hugetlbfs/inode.c | 2 +-
> fs/ubifs/file.c | 2 +-
> include/linux/migrate.h | 6 ++++--
> mm/migrate.c | 14 ++++++++------
> 5 files changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 428484f..a67c764 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -418,7 +418,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
> * events from being lost.
> */
> spin_lock_irqsave(&ctx->completion_lock, flags);
> - migrate_page_copy(new, old);
> + migrate_page_copy(new, old, 0);
Can we have a useful enum instead of 0, its harder to read and understand
0
> BUG_ON(ctx->ring_pages[idx] != old);
> ctx->ring_pages[idx] = new;
> spin_unlock_irqrestore(&ctx->completion_lock, flags);
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 4fb7b10..a17bfef 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -850,7 +850,7 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
> rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> if (rc != MIGRATEPAGE_SUCCESS)
> return rc;
> - migrate_page_copy(newpage, page);
> + migrate_page_copy(newpage, page, 0);
Ditto
>
> return MIGRATEPAGE_SUCCESS;
> }
> diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> index b4fbeef..bf54e32 100644
> --- a/fs/ubifs/file.c
> +++ b/fs/ubifs/file.c
> @@ -1468,7 +1468,7 @@ static int ubifs_migrate_page(struct address_space *mapping,
> SetPagePrivate(newpage);
> }
>
> - migrate_page_copy(newpage, page);
> + migrate_page_copy(newpage, page, 0);
Here as well
Balbir Singh.
On 23/11/16 03:25, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Hi all,
>
> This patchset boosts the hugepage migration throughput and helps THP migration
> which is added by Naoya's patches: https://lwn.net/Articles/705879/.
>
> Motivation
> ===============================
>
> In x86, 4KB page migrations are underutilizing the memory bandwidth compared
> to 2MB THP migrations. I did some page migration benchmarking on a two-socket
> Intel Xeon E5-2640v3 box, which has 23.4GB/s bandwidth, and discover
> there are big throughput gap, ~3x, between 4KB and 2MB page migrations.
>
> Here are the throughput numbers for different page sizes and page numbers:
> | 512 4KB pages | 1 2MB THP | 1 4KB page
> x86_64 | 0.98GB/s | 2.97GB/s | 0.06GB/s
>
> As Linux currently use single-threaded page migration, the throughput is still
> much lower than the hardware bandwidth, 2.97GB/s vs 23.4GB/s. So I parallelize
> the copy_page() part of THP migration with workqueue and achieve 2.8x throughput.
>
> Here are the throughput numbers of 2MB page migration:
> | single-threaded | 8-thread
> x86_64 2MB | 2.97GB/s | 8.58GB/s
>
Whats the impact on CPU utilization? Is there a huge impact?
Balbir Singh.
On 23/11/16 03:25, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> From: Zi Yan <[email protected]>
>
> No functionality is changed.
I think you'd want to say that the modes are no longer
exclusive. We can use them as flags in combination?
Balbir Singh.
On 24 Nov 2016, at 5:09, Anshuman Khandual wrote:
> On 11/22/2016 09:55 PM, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> From: Zi Yan <[email protected]>
>>
>> Since base page migration did not gain any speedup from
>> multi-threaded methods, we only accelerate the huge page case.
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> kernel/sysctl.c | 11 +++++++++++
>> mm/migrate.c | 6 ++++++
>> 2 files changed, 17 insertions(+)
>>
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index d54ce12..6c79444 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -98,6 +98,8 @@
>> #if defined(CONFIG_SYSCTL)
>>
>>
>> +extern int accel_page_copy;
>
> Hmm, accel_mthread_copy because this is achieved by a multi threaded
> copy mechanism.
>
>> +
>> /* External variables not in a header file. */
>> extern int suid_dumpable;
>> #ifdef CONFIG_COREDUMP
>> @@ -1361,6 +1363,15 @@ static struct ctl_table vm_table[] = {
>> .proc_handler = &hugetlb_mempolicy_sysctl_handler,
>> },
>> #endif
>> + {
>> + .procname = "accel_page_copy",
>> + .data = &accel_page_copy,
>> + .maxlen = sizeof(accel_page_copy),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec,
>> + .extra1 = &zero,
>> + .extra2 = &one,
>> + },
>> {
>> .procname = "hugetlb_shm_group",
>> .data = &sysctl_hugetlb_shm_group,
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 244ece6..e64b490 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -48,6 +48,8 @@
>>
>> #include "internal.h"
>>
>> +int accel_page_copy = 1;
>> +
>
> So its enabled by default.
>
>> /*
>> * migrate_prep() needs to be called before we start compiling a list of pages
>> * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
>> @@ -651,6 +653,10 @@ static void copy_huge_page(struct page *dst, struct page *src,
>> nr_pages = hpage_nr_pages(src);
>> }
>>
>> + /* Try to accelerate page migration if it is not specified in mode */
>> + if (accel_page_copy)
>> + mode |= MIGRATE_MT;
>
> So even if none of the system calls requested for a multi threaded copy,
> this setting will override every thing and make it multi threaded.
This only accelerates huge page copies and achieves much higher
throughput and lower copy time. It should be used most of the time.
As you suggested in other email, I will make this optimization a config option.
If people enable it, they expect it is working. The sysctl interface just let
them disable this optimization when they think it is not going to help.
--
Best Regards
Yan Zi
On 24 Nov 2016, at 18:56, Balbir Singh wrote:
> On 23/11/16 03:25, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> From: Zi Yan <[email protected]>
>>
>> migrate_page_copy() and copy_huge_page() are affected.
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> fs/aio.c | 2 +-
>> fs/hugetlbfs/inode.c | 2 +-
>> fs/ubifs/file.c | 2 +-
>> include/linux/migrate.h | 6 ++++--
>> mm/migrate.c | 14 ++++++++------
>> 5 files changed, 15 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index 428484f..a67c764 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -418,7 +418,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
>> * events from being lost.
>> */
>> spin_lock_irqsave(&ctx->completion_lock, flags);
>> - migrate_page_copy(new, old);
>> + migrate_page_copy(new, old, 0);
>
> Can we have a useful enum instead of 0, its harder to read and understand
> 0
How about MIGRATE_SINGLETHREAD = 0 ?
>> BUG_ON(ctx->ring_pages[idx] != old);
>> ctx->ring_pages[idx] = new;
>> spin_unlock_irqrestore(&ctx->completion_lock, flags);
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 4fb7b10..a17bfef 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -850,7 +850,7 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
>> rc = migrate_huge_page_move_mapping(mapping, newpage, page);
>> if (rc != MIGRATEPAGE_SUCCESS)
>> return rc;
>> - migrate_page_copy(newpage, page);
>> + migrate_page_copy(newpage, page, 0);
>
> Ditto
>
>>
>> return MIGRATEPAGE_SUCCESS;
>> }
>> diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
>> index b4fbeef..bf54e32 100644
>> --- a/fs/ubifs/file.c
>> +++ b/fs/ubifs/file.c
>> @@ -1468,7 +1468,7 @@ static int ubifs_migrate_page(struct address_space *mapping,
>> SetPagePrivate(newpage);
>> }
>>
>> - migrate_page_copy(newpage, page);
>> + migrate_page_copy(newpage, page, 0);
>
> Here as well
>
>
> Balbir Singh.
--
Best Regards
Yan Zi
On 24 Nov 2016, at 18:59, Balbir Singh wrote:
> On 23/11/16 03:25, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> Hi all,
>>
>> This patchset boosts the hugepage migration throughput and helps THP migration
>> which is added by Naoya's patches: https://lwn.net/Articles/705879/.
>>
>> Motivation
>> ===============================
>>
>> In x86, 4KB page migrations are underutilizing the memory bandwidth compared
>> to 2MB THP migrations. I did some page migration benchmarking on a two-socket
>> Intel Xeon E5-2640v3 box, which has 23.4GB/s bandwidth, and discover
>> there are big throughput gap, ~3x, between 4KB and 2MB page migrations.
>>
>> Here are the throughput numbers for different page sizes and page numbers:
>> | 512 4KB pages | 1 2MB THP | 1 4KB page
>> x86_64 | 0.98GB/s | 2.97GB/s | 0.06GB/s
>>
>> As Linux currently use single-threaded page migration, the throughput is still
>> much lower than the hardware bandwidth, 2.97GB/s vs 23.4GB/s. So I parallelize
>> the copy_page() part of THP migration with workqueue and achieve 2.8x throughput.
>>
>> Here are the throughput numbers of 2MB page migration:
>> | single-threaded | 8-thread
>> x86_64 2MB | 2.97GB/s | 8.58GB/s
>>
>
> Whats the impact on CPU utilization? Is there a huge impact?
>
> Balbir Singh.
It depends on the throughput we can achieve.
For single-threaded copy, the current routine, it takes one CPU 2MB/(2.97GB/s) = 657.6 us
to copy one 2MB page.
For 8-thread copy, it take 8 CPUs 2MB/(8.58GB/s) = 227.6 us to copy one 2MB page.
If we have 8 idle CPUs, I think it worths using them.
I am going to add code to check idle_cpu() in the system before doing the copy.
If no idle CPUs are present, I can fall back to single-threaded copy.
--
Best Regards
Yan Zi
On 24 Nov 2016, at 4:26, Anshuman Khandual wrote:
> On 11/22/2016 09:55 PM, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> From: Zi Yan <[email protected]>
>
> Please fix these.
>
>>
>> Internally, copy_page_mt splits a page into multiple threads
>> and send them as jobs to system_highpri_wq.
>
> The function should be renamed as copy_page_multithread() or at
> the least copy_page_mthread() to make more sense. The commit
> message needs to more comprehensive and detailed.
>
Sure.
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> include/linux/highmem.h | 2 ++
>> kernel/sysctl.c | 1 +
>> mm/Makefile | 2 ++
>> mm/copy_page.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 101 insertions(+)
>> create mode 100644 mm/copy_page.c
>>
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index bb3f329..519e575 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -236,6 +236,8 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>>
>> #endif
>>
>> +int copy_page_mt(struct page *to, struct page *from, int nr_pages);
>> +
>> static inline void copy_highpage(struct page *to, struct page *from)
>> {
>> char *vfrom, *vto;
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 706309f..d54ce12 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -97,6 +97,7 @@
>>
>> #if defined(CONFIG_SYSCTL)
>>
>> +
>
> I guess this is a stray code change.
>
>> /* External variables not in a header file. */
>> extern int suid_dumpable;
>> #ifdef CONFIG_COREDUMP
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 295bd7a..467305b 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -41,6 +41,8 @@ obj-y := filemap.o mempool.o oom_kill.o \
>>
>> obj-y += init-mm.o
>>
>> +obj-y += copy_page.o
>
> Its getting compiled all the time. Dont you want to make it part of
> of a new config option which will cover for all these code for multi
> thread copy ?
I can do that.
>> +
>> ifdef CONFIG_NO_BOOTMEM
>> obj-y += nobootmem.o
>> else
>> diff --git a/mm/copy_page.c b/mm/copy_page.c
>> new file mode 100644
>> index 0000000..ca7ce6c
>> --- /dev/null
>> +++ b/mm/copy_page.c
>> @@ -0,0 +1,96 @@
>> +/*
>> + * Parallel page copy routine.
>> + *
>> + * Zi Yan <[email protected]>
>> + *
>> + */
>
> No, this is too less. Please see other files inside mm directory as
> example.
>
Sure, I will add more description here.
>> +
>> +#include <linux/highmem.h>
>> +#include <linux/workqueue.h>
>> +#include <linux/slab.h>
>> +#include <linux/freezer.h>
>> +
>> +
>> +const unsigned int limit_mt_num = 4;
>
> From where this number 4 came from ? At the very least it should be
> configured from either a sysctl variable or from a sysfs file, so
> that user will have control on number of threads used for copy. But
> going forward this should be derived out a arch specific call back
> which then analyzes NUMA topology and scheduler loads to figure out
> on how many threads should be used for optimum performance of page
> copy.
I will expose this to sysctl.
For finding optimum performance, can we do a boot time microbenchmark
to find the thread number?
For scheduler loads, can we traverse all online CPUs and use idle CPUs
by checking idle_cpu()?
>> +
>> +/* ======================== multi-threaded copy page ======================== */
>> +
>
> Please use standard exported function description semantics while
> describing this new function. I think its a good function to be
> exported as a symbol as well.
>
>> +struct copy_page_info {
>
> s/copy_page_info/mthread_copy_struct/
>
>> + struct work_struct copy_page_work;
>> + char *to;
>> + char *from;
>
> Swap the order of 'to' and 'from'.
>
>> + unsigned long chunk_size;
>
> Just 'size' should be fine.
>
>> +};
>> +
>> +static void copy_page_routine(char *vto, char *vfrom,
>
> s/copy_page_routine/mthread_copy_fn/
>
>> + unsigned long chunk_size)
>> +{
>> + memcpy(vto, vfrom, chunk_size);
>> +}
>
> s/chunk_size/size/
>
>
Will do the suggested changes in the next version.
>> +
>> +static void copy_page_work_queue_thread(struct work_struct *work)
>> +{
>> + struct copy_page_info *my_work = (struct copy_page_info *)work;
>> +
>> + copy_page_routine(my_work->to,
>> + my_work->from,
>> + my_work->chunk_size);
>> +}
>> +
>> +int copy_page_mt(struct page *to, struct page *from, int nr_pages)
>> +{
>> + unsigned int total_mt_num = limit_mt_num;
>> + int to_node = page_to_nid(to);
>
> Should we make sure that the entire page range [to, to + nr_pages] is
> part of to_node.
>
Currently, this is only used for huge pages. nr_pages = hpage_nr_pages().
This guarantees the entire page range in the same node.
>> + int i;
>> + struct copy_page_info *work_items;
>> + char *vto, *vfrom;
>> + unsigned long chunk_size;
>> + const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
>
> So all the threads used for copy has to be part of cpumask of the
> destination node ? Why ? The copy accesses both the source pages as
> well as destination pages. Source node threads might also perform
> good for the memory accesses. Which and how many threads should be
> used for copy should be decided wisely from an architecture call
> back. On a NUMA system this will have impact on performance of the
> multi threaded copy.
This is based on my copy throughput benchmark results. The results
shows that moving data from the remote node to the local node (pulling)
has higher throughput than moving data from the local node to the remote node (pushing).
I got the same results from both Intel Xeon and IBM Power8, but
it might not be the case for other machines.
Ideally, we can do a boot time benchmark to find out the best configuration,
like pulling or pushing the data, how many threads. But for this
patchset, I may choose pulling the data.
>
>
>> + int cpu_id_list[32] = {0};
>> + int cpu;
>> +
>> + total_mt_num = min_t(unsigned int, total_mt_num,
>> + cpumask_weight(per_node_cpumask));
>> + total_mt_num = (total_mt_num / 2) * 2;
>> +
>> + work_items = kcalloc(total_mt_num, sizeof(struct copy_page_info),
>> + GFP_KERNEL);
>> + if (!work_items)
>> + return -ENOMEM;
>> +
>> + i = 0;
>> + for_each_cpu(cpu, per_node_cpumask) {
>> + if (i >= total_mt_num)
>> + break;
>> + cpu_id_list[i] = cpu;
>> + ++i;
>> + }
>> +
>> + vfrom = kmap(from);
>> + vto = kmap(to);
>> + chunk_size = PAGE_SIZE*nr_pages / total_mt_num;
>
> Coding style ? Please run all these patches though scripts/
> checkpatch.pl script to catch coding style problems.
>
>> +
>> + for (i = 0; i < total_mt_num; ++i) {
>> + INIT_WORK((struct work_struct *)&work_items[i],
>> + copy_page_work_queue_thread);
>> +
>> + work_items[i].to = vto + i * chunk_size;
>> + work_items[i].from = vfrom + i * chunk_size;
>> + work_items[i].chunk_size = chunk_size;
>> +
>> + queue_work_on(cpu_id_list[i],
>> + system_highpri_wq,
>> + (struct work_struct *)&work_items[i]);
>
> I am not very familiar with the system work queues but is
> system_highpri_wq has the highest priority ? Because if
> the time spend waiting on these work queue functions to
> execute increases it can offset out all the benefits we
> get by this multi threaded copy.
According to include/linux/workqueue.h, system_highpri_wq has
high priority.
Another option is to create a dedicated workqueue for all
copy jobs.
--
Best Regards
Yan Zi
On 24 Nov 2016, at 3:15, Anshuman Khandual wrote:
> On 11/22/2016 09:55 PM, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> From: Zi Yan <[email protected]>
>>
>> No functionality is changed.
>
> The commit message need to contains more details like it changes
> the enum declaration from numbers to bit positions, where all it
> changes existing code like compaction and migration.
>
Sure. I will add more detail description in the next version.
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> Signed-off-by: Zi Yan <[email protected]>
>
> Like last patch please fix the author details and signed offs.
>
Got it.
--
Best Regards
Yan Zi
On 11/28/2016 08:43 PM, Zi Yan wrote:
> On 24 Nov 2016, at 18:56, Balbir Singh wrote:
>
>> > On 23/11/16 03:25, Zi Yan wrote:
>>> >> From: Zi Yan <[email protected]>
>>> >>
>>> >> From: Zi Yan <[email protected]>
>>> >>
>>> >> migrate_page_copy() and copy_huge_page() are affected.
>>> >>
>>> >> Signed-off-by: Zi Yan <[email protected]>
>>> >> Signed-off-by: Zi Yan <[email protected]>
>>> >> ---
>>> >> fs/aio.c | 2 +-
>>> >> fs/hugetlbfs/inode.c | 2 +-
>>> >> fs/ubifs/file.c | 2 +-
>>> >> include/linux/migrate.h | 6 ++++--
>>> >> mm/migrate.c | 14 ++++++++------
>>> >> 5 files changed, 15 insertions(+), 11 deletions(-)
>>> >>
>>> >> diff --git a/fs/aio.c b/fs/aio.c
>>> >> index 428484f..a67c764 100644
>>> >> --- a/fs/aio.c
>>> >> +++ b/fs/aio.c
>>> >> @@ -418,7 +418,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
>>> >> * events from being lost.
>>> >> */
>>> >> spin_lock_irqsave(&ctx->completion_lock, flags);
>>> >> - migrate_page_copy(new, old);
>>> >> + migrate_page_copy(new, old, 0);
>> >
>> > Can we have a useful enum instead of 0, its harder to read and understand
>> > 0
> How about MIGRATE_SINGLETHREAD = 0 ?
Right, should be an enum declaration for all kind of single page
copy process. Now we have just two. We dont have to mention
number of threads in the multi threaded one.
MIGRATE_SINGLETHREAD
MIGRATE_MULTITHREAD
On 11/28/2016 08:33 PM, Zi Yan wrote:
> On 24 Nov 2016, at 4:26, Anshuman Khandual wrote:
>
>> On 11/22/2016 09:55 PM, Zi Yan wrote:
>>> From: Zi Yan <[email protected]>
>>>
>>> From: Zi Yan <[email protected]>
>>
>> Please fix these.
>>
>>>
>>> Internally, copy_page_mt splits a page into multiple threads
>>> and send them as jobs to system_highpri_wq.
>>
>> The function should be renamed as copy_page_multithread() or at
>> the least copy_page_mthread() to make more sense. The commit
>> message needs to more comprehensive and detailed.
>>
>
> Sure.
>
>>>
>>> Signed-off-by: Zi Yan <[email protected]>
>>> Signed-off-by: Zi Yan <[email protected]>
>>> ---
>>> include/linux/highmem.h | 2 ++
>>> kernel/sysctl.c | 1 +
>>> mm/Makefile | 2 ++
>>> mm/copy_page.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
>>> 4 files changed, 101 insertions(+)
>>> create mode 100644 mm/copy_page.c
>>>
>>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>>> index bb3f329..519e575 100644
>>> --- a/include/linux/highmem.h
>>> +++ b/include/linux/highmem.h
>>> @@ -236,6 +236,8 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>>>
>>> #endif
>>>
>>> +int copy_page_mt(struct page *to, struct page *from, int nr_pages);
>>> +
>>> static inline void copy_highpage(struct page *to, struct page *from)
>>> {
>>> char *vfrom, *vto;
>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>> index 706309f..d54ce12 100644
>>> --- a/kernel/sysctl.c
>>> +++ b/kernel/sysctl.c
>>> @@ -97,6 +97,7 @@
>>>
>>> #if defined(CONFIG_SYSCTL)
>>>
>>> +
>>
>> I guess this is a stray code change.
>>
>>> /* External variables not in a header file. */
>>> extern int suid_dumpable;
>>> #ifdef CONFIG_COREDUMP
>>> diff --git a/mm/Makefile b/mm/Makefile
>>> index 295bd7a..467305b 100644
>>> --- a/mm/Makefile
>>> +++ b/mm/Makefile
>>> @@ -41,6 +41,8 @@ obj-y := filemap.o mempool.o oom_kill.o \
>>>
>>> obj-y += init-mm.o
>>>
>>> +obj-y += copy_page.o
>>
>> Its getting compiled all the time. Dont you want to make it part of
>> of a new config option which will cover for all these code for multi
>> thread copy ?
>
> I can do that.
>
>>> +
>>> ifdef CONFIG_NO_BOOTMEM
>>> obj-y += nobootmem.o
>>> else
>>> diff --git a/mm/copy_page.c b/mm/copy_page.c
>>> new file mode 100644
>>> index 0000000..ca7ce6c
>>> --- /dev/null
>>> +++ b/mm/copy_page.c
>>> @@ -0,0 +1,96 @@
>>> +/*
>>> + * Parallel page copy routine.
>>> + *
>>> + * Zi Yan <[email protected]>
>>> + *
>>> + */
>>
>> No, this is too less. Please see other files inside mm directory as
>> example.
>>
>
> Sure, I will add more description here.
>
>>> +
>>> +#include <linux/highmem.h>
>>> +#include <linux/workqueue.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/freezer.h>
>>> +
>>> +
>>> +const unsigned int limit_mt_num = 4;
>>
>> From where this number 4 came from ? At the very least it should be
>> configured from either a sysctl variable or from a sysfs file, so
>> that user will have control on number of threads used for copy. But
>> going forward this should be derived out a arch specific call back
>> which then analyzes NUMA topology and scheduler loads to figure out
>> on how many threads should be used for optimum performance of page
>> copy.
>
> I will expose this to sysctl.
When you do, if the user specifies a number, take it after some basic
sanity checks but if the user specifies "default" then we should fall
back to arch specific call back to figure out how many of them should
be used.
>
> For finding optimum performance, can we do a boot time microbenchmark
> to find the thread number?
Hmm, not sure that much of measurement is required. I wonder any other
part of kernel does this kind of thing ? Arch call back should be able
to tell you based on already established facts. IMHO that should be
enough.
>
> For scheduler loads, can we traverse all online CPUs and use idle CPUs
> by checking idle_cpu()?
Arch call back should return a nodemask pointing out exactly which CPUs
should be used for the copy purpose. The nodemask should be calculated
based on established facts of multi threaded performance, NUMA affinity
(based on source and destination pages we are trying to migrate) and
scheduler run queue states. But this function has to be fast enough
without eating into the benefits of multi threaded copy. idle_cpu()
check should be a very good starting point.
>
>
>>> +
>>> +/* ======================== multi-threaded copy page ======================== */
>>> +
>>
>> Please use standard exported function description semantics while
>> describing this new function. I think its a good function to be
>> exported as a symbol as well.
>>
>>> +struct copy_page_info {
>>
>> s/copy_page_info/mthread_copy_struct/
>>
>>> + struct work_struct copy_page_work;
>>> + char *to;
>>> + char *from;
>>
>> Swap the order of 'to' and 'from'.
>>
>>> + unsigned long chunk_size;
>>
>> Just 'size' should be fine.
>>
>>> +};
>>> +
>>> +static void copy_page_routine(char *vto, char *vfrom,
>>
>> s/copy_page_routine/mthread_copy_fn/
>>
>>> + unsigned long chunk_size)
>>> +{
>>> + memcpy(vto, vfrom, chunk_size);
>>> +}
>>
>> s/chunk_size/size/
>>
>>
>
> Will do the suggested changes in the next version.
>
>>> +
>>> +static void copy_page_work_queue_thread(struct work_struct *work)
>>> +{
>>> + struct copy_page_info *my_work = (struct copy_page_info *)work;
>>> +
>>> + copy_page_routine(my_work->to,
>>> + my_work->from,
>>> + my_work->chunk_size);
>>> +}
>>> +
>>> +int copy_page_mt(struct page *to, struct page *from, int nr_pages)
>>> +{
>>> + unsigned int total_mt_num = limit_mt_num;
>>> + int to_node = page_to_nid(to);
>>
>> Should we make sure that the entire page range [to, to + nr_pages] is
>> part of to_node.
>>
>
> Currently, this is only used for huge pages. nr_pages = hpage_nr_pages().
> This guarantees the entire page range in the same node.
In that case WARN_ON() if both source and destination pages are not huge.
>
>>> + int i;
>>> + struct copy_page_info *work_items;
>>> + char *vto, *vfrom;
>>> + unsigned long chunk_size;
>>> + const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
>>
>> So all the threads used for copy has to be part of cpumask of the
>> destination node ? Why ? The copy accesses both the source pages as
>> well as destination pages. Source node threads might also perform
>> good for the memory accesses. Which and how many threads should be
>> used for copy should be decided wisely from an architecture call
>> back. On a NUMA system this will have impact on performance of the
>> multi threaded copy.
>
> This is based on my copy throughput benchmark results. The results
> shows that moving data from the remote node to the local node (pulling)
> has higher throughput than moving data from the local node to the remote node (pushing).
>
> I got the same results from both Intel Xeon and IBM Power8, but
> it might not be the case for other machines.
>
> Ideally, we can do a boot time benchmark to find out the best configuration,
> like pulling or pushing the data, how many threads. But for this
> patchset, I may choose pulling the data.
Right but again it should be decided by the arch call back.
>
>
>>
>>
>>> + int cpu_id_list[32] = {0};
>>> + int cpu;
>>> +
>>> + total_mt_num = min_t(unsigned int, total_mt_num,
>>> + cpumask_weight(per_node_cpumask));
>>> + total_mt_num = (total_mt_num / 2) * 2;
>>> +
>>> + work_items = kcalloc(total_mt_num, sizeof(struct copy_page_info),
>>> + GFP_KERNEL);
>>> + if (!work_items)
>>> + return -ENOMEM;
>>> +
>>> + i = 0;
>>> + for_each_cpu(cpu, per_node_cpumask) {
>>> + if (i >= total_mt_num)
>>> + break;
>>> + cpu_id_list[i] = cpu;
>>> + ++i;
>>> + }
>>> +
>>> + vfrom = kmap(from);
>>> + vto = kmap(to);
>>> + chunk_size = PAGE_SIZE*nr_pages / total_mt_num;
>>
>> Coding style ? Please run all these patches though scripts/
>> checkpatch.pl script to catch coding style problems.
>>
>>> +
>>> + for (i = 0; i < total_mt_num; ++i) {
>>> + INIT_WORK((struct work_struct *)&work_items[i],
>>> + copy_page_work_queue_thread);
>>> +
>>> + work_items[i].to = vto + i * chunk_size;
>>> + work_items[i].from = vfrom + i * chunk_size;
>>> + work_items[i].chunk_size = chunk_size;
>>> +
>>> + queue_work_on(cpu_id_list[i],
>>> + system_highpri_wq,
>>> + (struct work_struct *)&work_items[i]);
>>
>> I am not very familiar with the system work queues but is
>> system_highpri_wq has the highest priority ? Because if
>> the time spend waiting on these work queue functions to
>> execute increases it can offset out all the benefits we
>> get by this multi threaded copy.
>
> According to include/linux/workqueue.h, system_highpri_wq has
> high priority.
>
> Another option is to create a dedicated workqueue for all
> copy jobs.
Not sure, will have to check on this.