LinuxLists.cc - [RFC v3 1/2] mm, compaction: introduce kcompactd

2015-08-03 16:26:07

Subject: [RFC v3 1/2] mm, compaction: introduce kcompactd

v3: drop all changes to hugepages, just focus on kcompactd. Reworked
interactions with kswapd, no more periodic wakeups. Use
sysctl_extfrag_threshold for now. Loosely based on suggestions from Mel
Gorman and David Rientjes. Thanks.
Based on v4.2-rc4, only compile-tested. Will run some benchmarks, posting
now to keep discussions going and focus on kcompactd only.

Memory compaction can be currently performed in several contexts:

- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP page
fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc

The purpose of compaction is two-fold. The obvious purpose is to satisfy a
(pending or future) high-order allocation, and is easy to evaluate. The other
purpose is to keep overal memory fragmentation low and help the
anti-fragmentation mechanism. The success wrt the latter purpose is more
difficult to evaluate though.

The current situation wrt the purposes has a few drawbacks:

- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of keeping
memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would be
better if compaction was performed asynchronously to keep fragmentation low,
before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP page
faults can easily offset the benefits of THP.

To improve the situation, we should benefit from an equivalent of kswapd, but
for compaction - i.e. a background thread which responds to fragmentation and
the need for high-order allocations (including hugepages) somewhat proactively.

One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let kswapd
handle reclaim, as order-0 allocations are often more critical than high-order
ones.

Another possibility is to extend khugepaged, but this kthread is a single
instance and tied to THP configs.

This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new tunables.
The lifecycle mimics kswapd kthreads, including the memory hotplug hooks.

Waking up of the kcompactd threads is also tied to kswapd activity and follows
these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from the
slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so don't
invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd

The kswapd compact/reclaim loop for high-order pages is left alone for now
and precedes kcompactd wakeup, but this might be revisited later.

In this patch, kcompactd uses the standard compaction_suitable() and
compact_finished() criteria, which means it will most likely have nothing left
to do after kswapd is finished. This is changed to rely on
sysctl_extfrag_threshold by the next patch for review and dicussion purposes.

Other possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are not
available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation
event (i.e. __rmqueue_fallback()). It's also possible to perform periodic
compaction with kcompactd.

Not-yet-signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/compaction.h | 16 ++++
include/linux/mmzone.h | 7 +-
mm/compaction.c | 183 +++++++++++++++++++++++++++++++++++++++++++++
mm/memory_hotplug.c | 15 ++--
mm/page_alloc.c | 7 +-
mm/vmscan.c | 25 +++++--
6 files changed, 241 insertions(+), 12 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index aa8f61c..8cd1fb5 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -51,6 +51,10 @@ extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
extern bool compaction_restarting(struct zone *zone, int order);

+extern int kcompactd_run(int nid);
+extern void kcompactd_stop(int nid);
+extern void wakeup_kcompactd(pg_data_t *pgdat, int order);
+
#else
static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, int alloc_flags,
@@ -83,6 +87,18 @@ static inline bool compaction_deferred(struct zone *zone, int order)
return true;
}

+static int kcompactd_run(int nid)
+{
+ return 0;
+}
+static void kcompactd_stop(int nid)
+{
+}
+
+static void wakeup_kcompactd(pg_data_t *pgdat, int order)
+{
+}
+
#endif /* CONFIG_COMPACTION */

#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 754c259..423e88e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -752,6 +752,11 @@ typedef struct pglist_data {
mem_hotplug_begin/end() */
int kswapd_max_order;
enum zone_type classzone_idx;
+#ifdef CONFIG_COMPACTION
+ int kcompactd_max_order;
+ wait_queue_head_t kcompactd_wait;
+ struct task_struct *kcompactd;
+#endif
#ifdef CONFIG_NUMA_BALANCING
/* Lock serializing the migrate rate limiting window */
spinlock_t numabalancing_migrate_lock;
@@ -798,7 +803,7 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)

extern struct mutex zonelists_mutex;
void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
-void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
+bool wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
bool zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, int alloc_flags);
bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
diff --git a/mm/compaction.c b/mm/compaction.c
index 018f08d..b051412 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -17,6 +17,9 @@
#include <linux/balloon_compaction.h>
#include <linux/page-isolation.h>
#include <linux/kasan.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/module.h>
#include "internal.h"

#ifdef CONFIG_COMPACTION
@@ -29,6 +32,7 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
{
count_vm_events(item, delta);
}
+
#else
#define count_compact_event(item) do { } while (0)
#define count_compact_events(item, delta) do { } while (0)
@@ -1714,4 +1718,183 @@ void compaction_unregister_node(struct node *node)
}
#endif /* CONFIG_SYSFS && CONFIG_NUMA */

+static bool kcompactd_work_requested(pg_data_t *pgdat)
+{
+ return pgdat->kcompactd_max_order > 0;
+}
+
+static bool kcompactd_node_suitable(pg_data_t *pgdat, int order)
+{
+ int zoneid;
+ struct zone *zone;
+
+ for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+ zone = &pgdat->node_zones[zoneid];
+
+ if (compaction_suitable(zone, order, 0, zoneid) ==
+ COMPACT_CONTINUE)
+ return true;
+ }
+
+ return false;
+}
+
+static void kcompactd_do_work(pg_data_t *pgdat)
+{
+ /*
+ * With no special task, compact all zones so that a page of requested
+ * order is allocatable.
+ */
+ int zoneid;
+ struct zone *zone;
+ struct compact_control cc = {
+ .order = pgdat->kcompactd_max_order,
+ .mode = MIGRATE_SYNC_LIGHT,
+ //TODO: do this or not?
+ .ignore_skip_hint = true,
+ };
+
+ for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+
+ zone = &pgdat->node_zones[zoneid];
+ if (!populated_zone(zone))
+ continue;
+
+ if (compaction_suitable(zone, cc.order, 0, zoneid) !=
+ COMPACT_CONTINUE)
+ continue;
+
+ cc.nr_freepages = 0;
+ cc.nr_migratepages = 0;
+ cc.zone = zone;
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
+
+ compact_zone(zone, &cc);
+
+ if (zone_watermark_ok(zone, cc.order,
+ low_wmark_pages(zone), 0, 0))
+ compaction_defer_reset(zone, cc.order, false);
+
+ VM_BUG_ON(!list_empty(&cc.freepages));
+ VM_BUG_ON(!list_empty(&cc.migratepages));
+ }
+
+ /* Regardless of success, we are done until woken up next */
+ pgdat->kcompactd_max_order = 0;
+}
+
+void wakeup_kcompactd(pg_data_t *pgdat, int order)
+{
+ if (pgdat->kcompactd_max_order < order)
+ pgdat->kcompactd_max_order = order;
+
+ if (!waitqueue_active(&pgdat->kcompactd_wait))
+ return;
+
+ if (!kcompactd_node_suitable(pgdat, order))
+ return;
+
+ wake_up_interruptible(&pgdat->kcompactd_wait);
+}
+
+/*
+ * The background compaction daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kcompactd(void *p)
+{
+ pg_data_t *pgdat = (pg_data_t*)p;
+ struct task_struct *tsk = current;
+
+ const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+
+ if (!cpumask_empty(cpumask))
+ set_cpus_allowed_ptr(tsk, cpumask);
+
+ set_freezable();
+
+ while (!kthread_should_stop()) {
+ wait_event_freezable(pgdat->kcompactd_wait,
+ kcompactd_work_requested(pgdat));
+
+ kcompactd_do_work(pgdat);
+ }
+
+ return 0;
+}
+
+/*
+ * This kcompactd start function will be called by init and node-hot-add.
+ * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added.
+ */
+int kcompactd_run(int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+ int ret = 0;
+
+ if (pgdat->kcompactd)
+ return 0;
+
+ pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
+ if (IS_ERR(pgdat->kcompactd)) {
+ pr_err("Failed to start kcompactd on node %d\n", nid);
+ ret = PTR_ERR(pgdat->kcompactd);
+ pgdat->kcompactd = NULL;
+ }
+ return ret;
+}
+
+/*
+ * Called by memory hotplug when all memory in a node is offlined. Caller must
+ * hold mem_hotplug_begin/end().
+ */
+void kcompactd_stop(int nid)
+{
+ struct task_struct *kcompactd = NODE_DATA(nid)->kcompactd;
+
+ if (kcompactd) {
+ kthread_stop(kcompactd);
+ NODE_DATA(nid)->kcompactd = NULL;
+ }
+}
+
+/*
+ * It's optimal to keep kcompactd on the same CPUs as their memory, but
+ * not required for correctness. So if the last cpu in a node goes
+ * away, we get changed to run anywhere: as the first one comes back,
+ * restore their cpu bindings.
+ */
+static int cpu_callback(struct notifier_block *nfb, unsigned long action,
+ void *hcpu)
+{
+ int nid;
+
+ if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
+ for_each_node_state(nid, N_MEMORY) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+ const struct cpumask *mask;
+
+ mask = cpumask_of_node(pgdat->node_id);
+
+ if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
+ /* One of our CPUs online: restore mask */
+ set_cpus_allowed_ptr(pgdat->kcompactd, mask);
+ }
+ }
+ return NOTIFY_OK;
+}
+
+static int __init kcompactd_init(void)
+{
+ int nid;
+
+ for_each_node_state(nid, N_MEMORY)
+ kcompactd_run(nid);
+ hotcpu_notifier(cpu_callback, 0);
+ return 0;
+}
+
+module_init(kcompactd_init)
+
#endif /* CONFIG_COMPACTION */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 26fbba7..b2c695d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -32,6 +32,7 @@
#include <linux/hugetlb.h>
#include <linux/memblock.h>
#include <linux/bootmem.h>
+#include <linux/compaction.h>

#include <asm/tlbflush.h>

@@ -1001,7 +1002,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
arg.nr_pages = nr_pages;
node_states_check_changes_online(nr_pages, zone, &arg);

- nid = pfn_to_nid(pfn);
+ nid = zone_to_nid(zone);

ret = memory_notify(MEM_GOING_ONLINE, &arg);
ret = notifier_to_errno(ret);
@@ -1041,7 +1042,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
pgdat_resize_unlock(zone->zone_pgdat, &flags);

if (onlined_pages) {
- node_states_set_node(zone_to_nid(zone), &arg);
+ node_states_set_node(nid, &arg);
if (need_zonelists_rebuild)
build_all_zonelists(NULL, NULL);
else
@@ -1052,8 +1053,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ

init_per_zone_wmark_min();

- if (onlined_pages)
- kswapd_run(zone_to_nid(zone));
+ if (onlined_pages) {
+ kswapd_run(nid);
+ kcompactd_run(nid);
+ }

vm_total_pages = nr_free_pagecache_pages();

@@ -1783,8 +1786,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
zone_pcp_update(zone);

node_states_clear_node(node, &arg);
- if (arg.status_change_nid >= 0)
+ if (arg.status_change_nid >= 0) {
kswapd_stop(node);
+ kcompactd_stop(node);
+ }

vm_total_pages = nr_free_pagecache_pages();
writeback_set_ratelimit();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef19f22..ae3e795 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1,4 +1,5 @@
/*
+ *
* linux/mm/page_alloc.c
*
* Manages the free list, the system allocates free pages here.
@@ -2894,7 +2895,8 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)

for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
ac->high_zoneidx, ac->nodemask)
- wakeup_kswapd(zone, order, zone_idx(ac->preferred_zone));
+ if (!wakeup_kswapd(zone, order, zone_idx(ac->preferred_zone)))
+ wakeup_kcompactd(zone->zone_pgdat, order);
}

static inline int
@@ -5293,6 +5295,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
#endif
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
+#ifdef CONFIG_COMPACTION
+ init_waitqueue_head(&pgdat->kcompactd_wait);
+#endif
pgdat_page_ext_init(pgdat);

for (j = 0; j < MAX_NR_ZONES; j++) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e61445d..075f53c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3360,6 +3360,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
*/
reset_isolation_suitable(pgdat);

+ /*
+ * We have balanced the zone, but kcompactd might want to
+ * further reduce the fragmentation.
+ */
+ wakeup_kcompactd(pgdat, order);
+
if (!kthread_should_stop())
schedule();

@@ -3484,28 +3490,37 @@ static int kswapd(void *p)

/*
* A zone is low on free memory, so wake its kswapd task to service it.
+ *
+ * Returns false when wakeup was skipped because zone was already balanced.
+ * Returns true when wakeup was either done or skipped for other reasons.
+ *
+ * This is to decide when to try waking up kcompactd, which should be done
+ * only when kswapd is not running. Kcompactd may decide to perform more work
+ * than what satisfies zone_balanced().
*/
-void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
+bool wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
{
pg_data_t *pgdat;

if (!populated_zone(zone))
- return;
+ return true;

if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
- return;
+ return true;
pgdat = zone->zone_pgdat;
if (pgdat->kswapd_max_order < order) {
pgdat->kswapd_max_order = order;
pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
}
if (!waitqueue_active(&pgdat->kswapd_wait))
- return;
+ return true;
if (zone_balanced(zone, order, 0, 0))
- return;
+ return false;

trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
wake_up_interruptible(&pgdat->kswapd_wait);
+
+ return true;
}

#ifdef CONFIG_HIBERNATION
--
2.4.6

2015-08-03 16:26:24

by Vlastimil Babka

[permalink] [raw]

Subject: [RFC v3 2/2] mm, compaction: make kcompactd rely on sysctl_extfrag_threshold

The previous patch introduced kcompactd kthreads which are meant to keep
memory fragmentation lower than what kswapd achieves through its
reclaim/compaction activity. In order to do that, it needs a stricter criteria
to determine when to start/stop compacting, than the standard criteria that
try to satisfy a single next high-order allocation request. This patch
provides such criteria with minimal changes and no new tunables.

This patch uses the existing sysctl_extfrag_threshold tunable. This tunable
currently determines when direct compaction should stop trying to satisfy an
allocation - that happens when a page of desired order has not been made
available, but the fragmentation already dropped below given threshold, so we
expect further compaction to be too costly and possibly fail anyway.

For kcompactd, we simply ignore whether the page has been available, and
continue compacting, until fragmentation drops below the threshold (or the
whole zone is scanned).

Not-yet-signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/compaction.h | 7 ++++---
mm/compaction.c | 37 ++++++++++++++++++++++++++-----------
mm/internal.h | 1 +
mm/vmscan.c | 10 +++++-----
mm/vmstat.c | 12 +++++++-----
5 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 8cd1fb5..c615465 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -36,14 +36,15 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos);
extern int sysctl_compact_unevictable_allowed;

-extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern int fragmentation_index(struct zone *zone, unsigned int order,
+ bool ignore_suitable);
extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern unsigned long compaction_suitable(struct zone *zone, int order,
- int alloc_flags, int classzone_idx);
+ int alloc_flags, int classzone_idx, bool kcompactd);

extern void defer_compaction(struct zone *zone, int order);
extern bool compaction_deferred(struct zone *zone, int order);
@@ -73,7 +74,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
}

static inline unsigned long compaction_suitable(struct zone *zone, int order,
- int alloc_flags, int classzone_idx)
+ int alloc_flags, int classzone_idx, bool kcompactd)
{
return COMPACT_SKIPPED;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index b051412..62b9e51 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1183,6 +1183,19 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
cc->alloc_flags))
return COMPACT_CONTINUE;

+ if (cc->kcompactd) {
+ /*
+ * kcompactd continues even if watermarks are met, until the
+ * fragmentation index is so low that direct compaction
+ * wouldn't be attempted
+ */
+ int fragindex = fragmentation_index(zone, cc->order, true);
+ if (fragindex <= sysctl_extfrag_threshold)
+ return COMPACT_NOT_SUITABLE_ZONE;
+ else
+ return COMPACT_CONTINUE;
+ }
+
/* Direct compactor: Is a suitable page free? */
for (order = cc->order; order < MAX_ORDER; order++) {
struct free_area *area = &zone->free_area[order];
@@ -1231,7 +1244,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
* COMPACT_CONTINUE - If compaction should run now
*/
static unsigned long __compaction_suitable(struct zone *zone, int order,
- int alloc_flags, int classzone_idx)
+ int alloc_flags, int classzone_idx, bool kcompactd)
{
int fragindex;
unsigned long watermark;
@@ -1246,10 +1259,10 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
watermark = low_wmark_pages(zone);
/*
* If watermarks for high-order allocation are already met, there
- * should be no need for compaction at all.
+ * should be no need for compaction at all, unless it's kcompactd.
*/
- if (zone_watermark_ok(zone, order, watermark, classzone_idx,
- alloc_flags))
+ if (!kcompactd && zone_watermark_ok(zone, order, watermark,
+ classzone_idx, alloc_flags))
return COMPACT_PARTIAL;

/*
@@ -1272,7 +1285,7 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
*
* Only compact if a failure would be due to fragmentation.
*/
- fragindex = fragmentation_index(zone, order);
+ fragindex = fragmentation_index(zone, order, kcompactd);
if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
return COMPACT_NOT_SUITABLE_ZONE;

@@ -1280,11 +1293,12 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
}

unsigned long compaction_suitable(struct zone *zone, int order,
- int alloc_flags, int classzone_idx)
+ int alloc_flags, int classzone_idx, bool kcompactd)
{
unsigned long ret;

- ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
+ ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
+ kcompactd);
trace_mm_compaction_suitable(zone, order, ret);
if (ret == COMPACT_NOT_SUITABLE_ZONE)
ret = COMPACT_SKIPPED;
@@ -1302,7 +1316,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
unsigned long last_migrated_pfn = 0;

ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
- cc->classzone_idx);
+ cc->classzone_idx, cc->kcompactd);
switch (ret) {
case COMPACT_PARTIAL:
case COMPACT_SKIPPED:
@@ -1731,8 +1745,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat, int order)
for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
zone = &pgdat->node_zones[zoneid];

- if (compaction_suitable(zone, order, 0, zoneid) ==
- COMPACT_CONTINUE)
+ if (compaction_suitable(zone, order, 0, zoneid, true) ==
+ COMPACT_CONTINUE)
return true;
}

@@ -1750,6 +1764,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
struct compact_control cc = {
.order = pgdat->kcompactd_max_order,
.mode = MIGRATE_SYNC_LIGHT,
+ .kcompactd = true,
//TODO: do this or not?
.ignore_skip_hint = true,
};
@@ -1760,7 +1775,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
if (!populated_zone(zone))
continue;

- if (compaction_suitable(zone, cc.order, 0, zoneid) !=
+ if (compaction_suitable(zone, cc.order, 0, zoneid, true) !=
COMPACT_CONTINUE)
continue;

diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1..2cea51a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -184,6 +184,7 @@ struct compact_control {
unsigned long migrate_pfn; /* isolate_migratepages search base */
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
+ bool kcompactd; /* We are in kcompactd kthread */
int order; /* order a direct compactor needs */
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
const int alloc_flags; /* alloc flags of a direct compactor */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 075f53c..f6582b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2339,7 +2339,7 @@ static inline bool should_continue_reclaim(struct zone *zone,
return true;

/* If compaction would go ahead or the allocation would succeed, stop */
- switch (compaction_suitable(zone, sc->order, 0, 0)) {
+ switch (compaction_suitable(zone, sc->order, 0, 0, false)) {
case COMPACT_PARTIAL:
case COMPACT_CONTINUE:
return false;
@@ -2467,7 +2467,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
* If compaction is not ready to start and allocation is not likely
* to succeed without it, then keep reclaiming.
*/
- if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
+ if (compaction_suitable(zone, order, 0, 0, false) == COMPACT_SKIPPED)
return false;

return watermark_ok;
@@ -2941,7 +2941,7 @@ static bool zone_balanced(struct zone *zone, int order,
return false;

if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
- order, 0, classzone_idx) == COMPACT_SKIPPED)
+ order, 0, classzone_idx, false) == COMPACT_SKIPPED)
return false;

return true;
@@ -3065,8 +3065,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
* from memory. Do not reclaim more than needed for compaction.
*/
if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
- compaction_suitable(zone, sc->order, 0, classzone_idx)
- != COMPACT_SKIPPED)
+ compaction_suitable(zone, sc->order, 0, classzone_idx,
+ false) != COMPACT_SKIPPED)
testorder = 0;

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd97..9916110 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -643,7 +643,8 @@ static void fill_contig_page_info(struct zone *zone,
* The value can be used to determine if page reclaim or compaction
* should be used
*/
-static int __fragmentation_index(unsigned int order, struct contig_page_info *info)
+static int __fragmentation_index(unsigned int order,
+ struct contig_page_info *info, bool ignore_suitable)
{
unsigned long requested = 1UL << order;

@@ -651,7 +652,7 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in
return 0;

/* Fragmentation index only makes sense when a request would fail */
- if (info->free_blocks_suitable)
+ if (!ignore_suitable && info->free_blocks_suitable)
return -1000;

/*
@@ -664,12 +665,13 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in
}

/* Same as __fragmentation index but allocs contig_page_info on stack */
-int fragmentation_index(struct zone *zone, unsigned int order)
+int fragmentation_index(struct zone *zone, unsigned int order,
+ bool ignore_suitable)
{
struct contig_page_info info;

fill_contig_page_info(zone, order, &info);
- return __fragmentation_index(order, &info);
+ return __fragmentation_index(order, &info, ignore_suitable);
}
#endif

@@ -1635,7 +1637,7 @@ static void extfrag_show_print(struct seq_file *m,
zone->name);
for (order = 0; order < MAX_ORDER; ++order) {
fill_contig_page_info(zone, order, &info);
- index = __fragmentation_index(order, &info);
+ index = __fragmentation_index(order, &info, false);
seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
}

--
2.4.6

2015-08-10 09:44:22

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC v3 1/2] mm, compaction: introduce kcompactd

On 08/09/2015 05:37 PM, PINTU KUMAR wrote:
>> Waking up of the kcompactd threads is also tied to kswapd activity and follows
>> these rules:
>> - we don't want to affect any fastpaths, so wake up kcompactd only from the
>> slowpath, as it's done for kswapd
>> - if kswapd is doing reclaim, it's more important than compaction, so
>> don't
>> invoke kcompactd until kswapd goes to sleep
>> - the target order used for kswapd is passed to kcompactd
>>
>> The kswapd compact/reclaim loop for high-order pages is left alone for now
>> and precedes kcompactd wakeup, but this might be revisited later.
>
> kcompactd, will be really nice thing to have, but I oppose calling it from kswapd.
> Because, just after kswapd, we already have direct_compact.

Just to be clear, here you mean that kswapd already does the
compact/reclaim loop?

> So it may end up in doing compaction 2 times.

The compact/reclaim loop might already do multiple iterations. The point
is, kswapd will terminate the loop as soon as single page of desired
order becomes available. Kcompactd is meant to go beyond that.
And having kcompactd run in parallel with kswapd's reclaim looks like
nonsense to me, so I don't see other way than have kswapd wake up
kcompactd when it's finished.

> Or, is it like, with kcompactd, we dont need direct_compact?

That will have to be evaluated. It would be nice to not need the
compact/reclaim loop, but I'm not sure it's always possible. We could
move it to kcompactd, but it would still mean that no daemon does
exclusively just reclaim or just compaction.

> In embedded world situation is really worse.
> As per my experience in embedded world, just compaction does not help always in longer run.
>
> As I know there are already some Android model in market, that already run background compaction (from user space).
> But still there are sluggishness issues due to bad memory state in the long run.

It should still be better with background compaction than without it. Of
course, avoiding a permanent fragmentation completely is not possible to
guarantee as it depends on the allocation patterns.

> In embedded world, the major problems are related to camera and browser use cases that requires almost order-8 allocations.
> Also, for low RAM configurations (less than 512M, 256M etc.), the rate of failure of compaction is much higher than the rate of success.

I was under impression that CMA was introduced to deal with such
high-order requirements in the embedded world?

> How can we guarantee that kcompactd is suitable for all situations?

We can't :) we can only hope to improve the average case. Anything that
needs high-order *guarantees* has to rely on CMA or another kind of
reservation (yeah even CMA is a pageblock reservation in some sense).

> In an case, we need large amount of testing to cover all scenarios.
> It should be called at the right time.
> I dont have any data to present right now.
> May be I will try to capture some data, and present here.

That would be nice. I'm going to collect some as well.

2015-08-10 09:54:26

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC v3 2/2] mm, compaction: make kcompactd rely on sysctl_extfrag_threshold

On 08/09/2015 07:21 PM, PINTU KUMAR wrote:
>>
>> -extern int fragmentation_index(struct zone *zone, unsigned int order);
>> +extern int fragmentation_index(struct zone *zone, unsigned int order,
>
>> + bool ignore_suitable);
>
> We would like to retain the original fragmentation_index as it is.
> Because in some cases people may be using it without kcompactd.
> In such cases, future kernel upgrades will suffer.
> In my opinion fragmentation_index should work just based on zones and order.

I don't understand the concern. If you pass 'false' to ignore_suitable,
you get the standard behavior. Only kcompactd uses the altered behavior.

2015-08-11 08:51:20

by PINTU KUMAR

[permalink] [raw]

Subject: RE: [RFC v3 1/2] mm, compaction: introduce kcompactd

Hi,

> -----Original Message-----
> From: Vlastimil Babka [mailto:[email protected]]
> Sent: Monday, August 10, 2015 3:14 PM
> To: PINTU KUMAR; [email protected]
> Cc: [email protected]; Andrew Morton; Hugh Dickins; Andrea
> Arcangeli; Kirill A. Shutemov; Rik van Riel; Mel Gorman; David Rientjes; Joonsoo
> Kim; Pintu Kumar
> Subject: Re: [RFC v3 1/2] mm, compaction: introduce kcompactd
>
> On 08/09/2015 05:37 PM, PINTU KUMAR wrote:
> >> Waking up of the kcompactd threads is also tied to kswapd activity
> >> and follows these rules:
> >> - we don't want to affect any fastpaths, so wake up kcompactd only from the
> >> slowpath, as it's done for kswapd
> >> - if kswapd is doing reclaim, it's more important than compaction, so
> >> don't
> >> invoke kcompactd until kswapd goes to sleep
> >> - the target order used for kswapd is passed to kcompactd
> >>
> >> The kswapd compact/reclaim loop for high-order pages is left alone
> >> for now and precedes kcompactd wakeup, but this might be revisited later.
> >
> > kcompactd, will be really nice thing to have, but I oppose calling it from
> kswapd.
> > Because, just after kswapd, we already have direct_compact.
>
> Just to be clear, here you mean that kswapd already does the compact/reclaim
> loop?
>
No, I mean in slowpath, after kswapd, there is already direct_compact/reclaim.

> > So it may end up in doing compaction 2 times.
>
> The compact/reclaim loop might already do multiple iterations. The point is,
> kswapd will terminate the loop as soon as single page of desired order becomes
> available. Kcompactd is meant to go beyond that.
> And having kcompactd run in parallel with kswapd's reclaim looks like nonsense
> to me, so I don't see other way than have kswapd wake up kcompactd when it's
> finished.
>
But, if kswapd is disabled then even kcompactd will not be called. Then it will be same situation.
Just a thought, how about creating a kworker thread for performing kcompactd?
May be schedule it on demand (based on current fragmentation level of COSTLY_ORDER), from other sub-system.
Or, may be invoke it when direct_reclaim fails.
Because, as per my observation, running compaction, immediately after reclaim gives more benefit.
How about tracking all higher order in kernel and understand who actually needs it.

> > Or, is it like, with kcompactd, we dont need direct_compact?
>
> That will have to be evaluated. It would be nice to not need the compact/reclaim
> loop, but I'm not sure it's always possible. We could move it to kcompactd, but it
> would still mean that no daemon does exclusively just reclaim or just
> compaction.
>
> > In embedded world situation is really worse.
> > As per my experience in embedded world, just compaction does not help
> always in longer run.
> >
> > As I know there are already some Android model in market, that already run
> background compaction (from user space).
> > But still there are sluggishness issues due to bad memory state in the long run.
>
> It should still be better with background compaction than without it. Of course,
> avoiding a permanent fragmentation completely is not possible to guarantee as it
> depends on the allocation patterns.
>
> > In embedded world, the major problems are related to camera and browser use
> cases that requires almost order-8 allocations.
> > Also, for low RAM configurations (less than 512M, 256M etc.), the rate of
> failure of compaction is much higher than the rate of success.
>
> I was under impression that CMA was introduced to deal with such high-order
> requirements in the embedded world?
>
CMA has its own limitations and drawbacks (because of movable pages criteria).
Please check this:
https://lkml.org/lkml/2014/5/7/810
So, for low RAM devices we try to make CMA as tight and low as possible.
For IOMMU supported devices (camera etc.), we don’t need CMA.
For Android case, they use ION system heap that rely on higher-order (with fallback mechanism), then perform scatter/gather.
For more information, please check this:
drivers/staging/android/ion/ion_system_heap.c

> > How can we guarantee that kcompactd is suitable for all situations?
>
> We can't :) we can only hope to improve the average case. Anything that needs
> high-order *guarantees* has to rely on CMA or another kind of reservation (yeah
> even CMA is a pageblock reservation in some sense).
>
> > In an case, we need large amount of testing to cover all scenarios.
> > It should be called at the right time.
> > I dont have any data to present right now.
> > May be I will try to capture some data, and present here.
>
> That would be nice. I'm going to collect some as well.

Specially, I would like to see the results on low RAM (less than 512M).
I will also share if I get anything interesting.
Thanks.