2015-07-02 08:47:33

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC v2 0/4] Outsourcing compaction for THP allocations to kcompactd

This RFC series is another evolution of the attempt to deal with THP
allocations latencies. Please see the motivation in the previous version [1]

The main difference here is that I've bitten the bullet and implemented
per-node kcompactd kthreads - see Patch 1 for the details of why and how.
Trying to fit everything into khugepaged was getting too clumsy, and kcompactd
could have more benefits, see e.g. the ideas here [2]. Not everything is
implemented yet, though, I would welcome some feedback first.

The devil will be in the details of course, i.e. how to steer the kcompactd
activity. Ideally it should take somehow into account the amount of free memory,
its fragmentation, pressure for high-order allocations including hugepages,
past successes/failures of compaction, the CPU time spent... not an easy task.
Suggestions welcome :)

I briefly tested it with mmtests/thpscale, but I don't think the results are
that interesting at this moment.

The patchset is based on v4.1, next would probably conflict in at least
mempolicy.c. I know it's still merge window, but didn't want to delay 2 weeks
due to upcoming vacation. Thanks.

[1] https://lwn.net/Articles/643891/
[2] http://article.gmane.org/gmane.linux.kernel/1982369

Vlastimil Babka (4):
mm, compaction: introduce kcompactd
mm, thp: stop preallocating hugepages in khugepaged
mm, thp: check for hugepage availability in khugepaged
mm, thp: check hugepage availability for fault allocations

include/linux/compaction.h | 13 +++
include/linux/mmzone.h | 8 ++
mm/compaction.c | 207 +++++++++++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 180 +++++++++++++++++++--------------------
mm/internal.h | 39 +++++++++
mm/memory_hotplug.c | 15 ++--
mm/mempolicy.c | 42 +++++----
mm/page_alloc.c | 6 ++
mm/vmscan.c | 7 ++
9 files changed, 403 insertions(+), 114 deletions(-)

--
2.4.3


2015-07-02 08:47:51

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC 1/4] mm, compaction: introduce kcompactd

Memory compaction can be currently performed in several contexts:

- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP page
fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc

The purpose of compaction is two-fold. The obvious purpose is to satisfy a
(pending or future) high-order allocation, and is easy to evaluate. The other
purpose is to keep overal memory fragmentation low and help the
anti-fragmentation mechanism. The success wrt the latter purpose is more
difficult to evaluate.

The current situation wrt the purposes has a few drawbacks:

- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of keeping
memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would be
better if compaction was performed asynchronously to keep fragmentation low,
before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP page
faults can easily offset the benefits of THP.

To improve the situation, we need an equivalent of kswapd, but for compaction.
E.g. a background thread which responds to fragmentation and the need for
high-order allocations (including hugepages) somewhat proactively.

One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let kswapd
handle reclaim, as order-0 allocations are often more critical than high-order
ones.

Another possibility is to extend khugepaged, but this kthread is a single
instance and tied to THP configs.

This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations. The lifecycle mimics kswapd kthreads.

The work loop of kcompactd currently mimics an pageblock-order direct
compaction attempt each 15 seconds. This might not be enough to keep
fragmentation low, and needs evaluation.

When there's not enough free memory for compaction, kswapd is woken up for
reclaim only (not compaction/reclaim).

Further patches will add the ability to wake up kcompactd on demand in special
situations such as when hugepages are not available, or when a fragmentation
event occured.

Not-yet-signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/compaction.h | 11 +++
include/linux/mmzone.h | 4 ++
mm/compaction.c | 173 +++++++++++++++++++++++++++++++++++++++++++++
mm/memory_hotplug.c | 15 ++--
mm/page_alloc.c | 3 +
5 files changed, 201 insertions(+), 5 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index aa8f61c..a2525d8 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -51,6 +51,9 @@ extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
extern bool compaction_restarting(struct zone *zone, int order);

+extern int kcompactd_run(int nid);
+extern void kcompactd_stop(int nid);
+
#else
static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, int alloc_flags,
@@ -83,6 +86,14 @@ static inline bool compaction_deferred(struct zone *zone, int order)
return true;
}

+static int kcompactd_run(int nid)
+{
+ return 0;
+}
+extern void kcompactd_stop(int nid)
+{
+}
+
#endif /* CONFIG_COMPACTION */

#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..bc96a23 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -762,6 +762,10 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
#endif
+#ifdef CONFIG_COMPACTION
+ struct task_struct *kcompactd;
+ wait_queue_head_t kcompactd_wait;
+#endif
} pg_data_t;

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/mm/compaction.c b/mm/compaction.c
index 018f08d..fcbc093 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -17,6 +17,8 @@
#include <linux/balloon_compaction.h>
#include <linux/page-isolation.h>
#include <linux/kasan.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
#include "internal.h"

#ifdef CONFIG_COMPACTION
@@ -29,6 +31,10 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
{
count_vm_events(item, delta);
}
+
+//TODO: add tuning knob
+static unsigned int kcompactd_sleep_millisecs __read_mostly = 15000;
+
#else
#define count_compact_event(item) do { } while (0)
#define count_compact_events(item, delta) do { } while (0)
@@ -1714,4 +1720,171 @@ void compaction_unregister_node(struct node *node)
}
#endif /* CONFIG_SYSFS && CONFIG_NUMA */

+/*
+ * Has any special work been requested of kcompactd?
+ */
+static bool kcompactd_work_requested(pg_data_t *pgdat)
+{
+ return false;
+}
+
+static void kcompactd_do_work(pg_data_t *pgdat)
+{
+ /*
+ * //TODO: smarter decisions on how much to compact. Using pageblock
+ * order might result in no compaction, until fragmentation builds up
+ * too much. Using order -1 could be too aggressive on large zones.
+ *
+ * With no special task, compact all zones so that a pageblock-order
+ * page is allocatable. Wake up kswapd if there's not enough free
+ * memory for compaction.
+ */
+ int zoneid;
+ struct zone *zone;
+ struct compact_control cc = {
+ .order = pageblock_order,
+ .mode = MIGRATE_SYNC,
+ .ignore_skip_hint = true,
+ };
+
+ for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+
+ int suitable;
+
+ zone = &pgdat->node_zones[zoneid];
+ if (!populated_zone(zone))
+ continue;
+
+ suitable = compaction_suitable(zone, cc.order, 0, 0);
+
+ if (suitable == COMPACT_SKIPPED) {
+ /*
+ * We pass order==0 to kswapd so it doesn't compact by
+ * itself. We just need enough free pages to proceed
+ * with compaction here on next kcompactd wakeup.
+ */
+ wakeup_kswapd(zone, 0, 0);
+ continue;
+ }
+ if (suitable == COMPACT_PARTIAL)
+ continue;
+
+ cc.nr_freepages = 0;
+ cc.nr_migratepages = 0;
+ cc.zone = zone;
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
+
+ compact_zone(zone, &cc);
+
+ if (zone_watermark_ok(zone, cc.order,
+ low_wmark_pages(zone), 0, 0))
+ compaction_defer_reset(zone, cc.order, false);
+
+ VM_BUG_ON(!list_empty(&cc.freepages));
+ VM_BUG_ON(!list_empty(&cc.migratepages));
+ }
+
+}
+
+/*
+ * The background compaction daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kcompactd(void *p)
+{
+ pg_data_t *pgdat = (pg_data_t*)p;
+ struct task_struct *tsk = current;
+
+ const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+
+ if (!cpumask_empty(cpumask))
+ set_cpus_allowed_ptr(tsk, cpumask);
+
+ set_freezable();
+
+ while (!kthread_should_stop()) {
+ kcompactd_do_work(pgdat);
+
+ wait_event_freezable_timeout(pgdat->kcompactd_wait,
+ kcompactd_work_requested(pgdat),
+ msecs_to_jiffies(kcompactd_sleep_millisecs));
+ }
+
+ return 0;
+}
+
+/*
+ * This kcompactd start function will be called by init and node-hot-add.
+ * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added.
+ */
+int kcompactd_run(int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+ int ret = 0;
+
+ if (pgdat->kcompactd)
+ return 0;
+
+ pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
+ if (IS_ERR(pgdat->kcompactd)) {
+ pr_err("Failed to start kcompactd on node %d\n", nid);
+ ret = PTR_ERR(pgdat->kcompactd);
+ pgdat->kcompactd = NULL;
+ }
+ return ret;
+}
+
+/*
+ * Called by memory hotplug when all memory in a node is offlined. Caller must
+ * hold mem_hotplug_begin/end().
+ */
+void kcompactd_stop(int nid)
+{
+ struct task_struct *kcompactd = NODE_DATA(nid)->kcompactd;
+
+ if (kcompactd) {
+ kthread_stop(kcompactd);
+ NODE_DATA(nid)->kcompactd = NULL;
+ }
+}
+
+/*
+ * It's optimal to keep kcompactd on the same CPUs as their memory, but
+ * not required for correctness. So if the last cpu in a node goes
+ * away, we get changed to run anywhere: as the first one comes back,
+ * restore their cpu bindings.
+ */
+static int cpu_callback(struct notifier_block *nfb, unsigned long action,
+ void *hcpu)
+{
+ int nid;
+
+ if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
+ for_each_node_state(nid, N_MEMORY) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+ const struct cpumask *mask;
+
+ mask = cpumask_of_node(pgdat->node_id);
+
+ if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
+ /* One of our CPUs online: restore mask */
+ set_cpus_allowed_ptr(pgdat->kcompactd, mask);
+ }
+ }
+ return NOTIFY_OK;
+}
+
+static int __init kcompactd_init(void)
+{
+ int nid;
+
+ for_each_node_state(nid, N_MEMORY)
+ kcompactd_run(nid);
+ hotcpu_notifier(cpu_callback, 0);
+ return 0;
+}
+
+module_init(kcompactd_init)
+
#endif /* CONFIG_COMPACTION */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9e88f74..3412aa4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -32,6 +32,7 @@
#include <linux/hugetlb.h>
#include <linux/memblock.h>
#include <linux/bootmem.h>
+#include <linux/compaction.h>

#include <asm/tlbflush.h>

@@ -1000,7 +1001,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
arg.nr_pages = nr_pages;
node_states_check_changes_online(nr_pages, zone, &arg);

- nid = pfn_to_nid(pfn);
+ nid = zone_to_nid(zone);

ret = memory_notify(MEM_GOING_ONLINE, &arg);
ret = notifier_to_errno(ret);
@@ -1040,7 +1041,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
pgdat_resize_unlock(zone->zone_pgdat, &flags);

if (onlined_pages) {
- node_states_set_node(zone_to_nid(zone), &arg);
+ node_states_set_node(nid, &arg);
if (need_zonelists_rebuild)
build_all_zonelists(NULL, NULL);
else
@@ -1051,8 +1052,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ

init_per_zone_wmark_min();

- if (onlined_pages)
- kswapd_run(zone_to_nid(zone));
+ if (onlined_pages) {
+ kswapd_run(nid);
+ kcompactd_run(nid);
+ }

vm_total_pages = nr_free_pagecache_pages();

@@ -1782,8 +1785,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
zone_pcp_update(zone);

node_states_clear_node(node, &arg);
- if (arg.status_change_nid >= 0)
+ if (arg.status_change_nid >= 0) {
kswapd_stop(node);
+ kcompactd_stop(node);
+ }

vm_total_pages = nr_free_pagecache_pages();
writeback_set_ratelimit();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..d9cd834 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4910,6 +4910,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
#endif
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
+#ifdef CONFIG_COMPACTION
+ init_waitqueue_head(&pgdat->kcompactd_wait);
+#endif
pgdat_page_ext_init(pgdat);

for (j = 0; j < MAX_NR_ZONES; j++) {
--
2.4.3

2015-07-02 08:47:44

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC 2/4] mm, thp: stop preallocating hugepages in khugepaged

Khugepaged tries to preallocate a hugepage before scanning for THP collapse
candidates. If the preallocation fails, scanning is not attempted. This makes
sense, but it is only restricted to !NUMA configurations, where it does not
need to predict on which node to preallocate.

Besides the !NUMA restriction, the preallocated page may also end up being
unused and put back when no collapse candidate is found. I have observed the
thp_collapse_alloc vmstat counter to have 3+ times the value of the counter
of actually collapsed pages in /sys/.../khugepaged/pages_collapsed. On the
other hand, the periodic hugepage allocation attempts involving sync
compaction can be beneficial for the antifragmentation mechanism, but that's
however harder to evaluate.

The following patch will introduce per-node THP availability tracking, which
has more benefits than current preallocation and is applicable to CONFIG_NUMA.
We can therefore remove the preallocation, which also allows a cleanup of the
functions involved in khugepaged allocations. Another small benefit of the
patch is that NUMA configs can now reuse an allocated hugepage for another
collapse attempt, if the previous one was for the same node and failed.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/huge_memory.c | 142 +++++++++++++++++--------------------------------------
1 file changed, 44 insertions(+), 98 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 078832c..6d83d05 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,9 +765,9 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
return 0;
}

-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+static inline gfp_t alloc_hugepage_gfpmask(int defrag)
{
- return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+ return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT));
}

/* Caller must hold page table lock. */
@@ -825,7 +825,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
return 0;
}
- gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
+ gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
@@ -1116,7 +1116,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow()) {
- huge_gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
+ huge_gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
} else
new_page = NULL;
@@ -2317,40 +2317,32 @@ static int khugepaged_find_target_node(void)
last_khugepaged_target_node = target_node;
return target_node;
}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+#else
+static int khugepaged_find_target_node(void)
{
- if (IS_ERR(*hpage)) {
- if (!*wait)
- return false;
-
- *wait = false;
- *hpage = NULL;
- khugepaged_alloc_sleep();
- } else if (*hpage) {
- put_page(*hpage);
- *hpage = NULL;
- }
-
- return true;
+ return 0;
}
+#endif

-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- int node)
+static struct page
+*khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
{
- VM_BUG_ON_PAGE(*hpage, *hpage);
-
/*
- * Before allocating the hugepage, release the mmap_sem read lock.
- * The allocation can take potentially a long time if it involves
- * sync compaction, and we do not need to hold the mmap_sem during
- * that. We will recheck the vma after taking it again in write mode.
+ * If we allocated a hugepage previously and failed to collapse, reuse
+ * the page, unless it's on different NUMA node.
*/
- up_read(&mm->mmap_sem);
+ if (!IS_ERR_OR_NULL(*hpage)) {
+ if (IS_ENABLED(CONFIG_NUMA) && page_to_nid(*hpage) != node) {
+ put_page(*hpage);
+ *hpage = NULL;
+ } else {
+ return *hpage;
+ }
+ }

+ gfp |= __GFP_THISNODE | __GFP_OTHER_NODE;
*hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER);
+
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
@@ -2360,60 +2352,6 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
count_vm_event(THP_COLLAPSE_ALLOC);
return *hpage;
}
-#else
-static int khugepaged_find_target_node(void)
-{
- return 0;
-}
-
-static inline struct page *alloc_hugepage(int defrag)
-{
- return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
- HPAGE_PMD_ORDER);
-}
-
-static struct page *khugepaged_alloc_hugepage(bool *wait)
-{
- struct page *hpage;
-
- do {
- hpage = alloc_hugepage(khugepaged_defrag());
- if (!hpage) {
- count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
- if (!*wait)
- return NULL;
-
- *wait = false;
- khugepaged_alloc_sleep();
- } else
- count_vm_event(THP_COLLAPSE_ALLOC);
- } while (unlikely(!hpage) && likely(khugepaged_enabled()));
-
- return hpage;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
- if (!*hpage)
- *hpage = khugepaged_alloc_hugepage(wait);
-
- if (unlikely(!*hpage))
- return false;
-
- return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- int node)
-{
- up_read(&mm->mmap_sem);
- VM_BUG_ON(!*hpage);
-
- return *hpage;
-}
-#endif

static bool hugepage_vma_check(struct vm_area_struct *vma)
{
@@ -2449,17 +2387,25 @@ static void collapse_huge_page(struct mm_struct *mm,

VM_BUG_ON(address & ~HPAGE_PMD_MASK);

- /* Only allocate from the target node */
- gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) |
- __GFP_THISNODE;
+ /*
+ * Determine the flags relevant for both hugepage allocation and memcg
+ * charge. Hugepage allocation may still add __GFP_THISNODE and
+ * __GFP_OTHER_NODE, which memcg ignores.
+ */
+ gfp = alloc_hugepage_gfpmask(khugepaged_defrag());

- /* release the mmap_sem read lock. */
- new_page = khugepaged_alloc_page(hpage, gfp, mm, vma, address, node);
+ /*
+ * Before allocating the hugepage, release the mmap_sem read lock.
+ * The allocation can take potentially a long time if it involves
+ * sync compaction, and we do not need to hold the mmap_sem during
+ * that. We will recheck the vma after taking it again in write mode.
+ */
+ up_read(&mm->mmap_sem);
+ new_page = khugepaged_alloc_page(hpage, gfp, node);
if (!new_page)
return;

- if (unlikely(mem_cgroup_try_charge(new_page, mm,
- gfp, &memcg)))
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg)))
return;

/*
@@ -2788,15 +2734,9 @@ static void khugepaged_do_scan(void)
{
struct page *hpage = NULL;
unsigned int progress = 0, pass_through_head = 0;
- unsigned int pages = khugepaged_pages_to_scan;
- bool wait = true;
-
- barrier(); /* write khugepaged_pages_to_scan to local stack */
+ unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);

while (progress < pages) {
- if (!khugepaged_prealloc_page(&hpage, &wait))
- break;
-
cond_resched();

if (unlikely(kthread_should_stop() || freezing(current)))
@@ -2812,6 +2752,12 @@ static void khugepaged_do_scan(void)
else
progress = pages;
spin_unlock(&khugepaged_mm_lock);
+
+ /* THP allocation has failed during collapse */
+ if (IS_ERR(hpage)) {
+ khugepaged_alloc_sleep();
+ break;
+ }
}

if (!IS_ERR_OR_NULL(hpage))
--
2.4.3

2015-07-02 08:48:40

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC 3/4] mm, thp: check for hugepage availability in khugepaged

Khugepaged could be scanning for collapse candidates uselessly, if it cannot
allocate a hugepage for the actual collapse event. The hugepage preallocation
mechanism has prevented this, but only for !NUMA configurations. It was
removed by the previous patch, and this patch replaces it with a more generic
mechanism.

The patch itroduces a thp_avail_nodes nodemask, which initially assumes that
hugepage can be allocated on any node. Whenever khugepaged fails to allocate
a hugepage, it clears the corresponding node bit. Before scanning for collapse
candidates, it checks the availability on all nodes and wakes up kcompactd
on nodes that have their bit cleared. Kcompactd sets the bit back in case of
a successful compaction.

During the scaning, khugepaged avoids collapsing on nodes with the bit
cleared. If no nodes have hugepages available, collapse scanning is skipped
altogether.

During testing, the patch did not show much difference in preventing
thp_collapse_failed events from khugepaged, but this can be attributed to the
sync compaction, which only khugepaged is allowed to use, and which is
heavyweight enough to succeed frequently enough nowadays. The next patch will
however extend the nodemask check to page fault context, where it has much
larger impact. Also, with the possible future plan to convert THP collapsing
to task_work context, this patch is also a preparation to avoid useless
scanning or heavyweight THP allocations in that context.

Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/compaction.h | 2 ++
include/linux/mmzone.h | 4 ++++
mm/compaction.c | 38 +++++++++++++++++++++++++++--
mm/huge_memory.c | 60 ++++++++++++++++++++++++++++++++++++++++------
mm/internal.h | 39 ++++++++++++++++++++++++++++++
mm/vmscan.c | 7 ++++++
6 files changed, 141 insertions(+), 9 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a2525d8..9c1cdb3 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -53,6 +53,8 @@ extern bool compaction_restarting(struct zone *zone, int order);

extern int kcompactd_run(int nid);
extern void kcompactd_stop(int nid);
+extern bool kcompactd_work_requested(pg_data_t *pgdat);
+extern void wakeup_kcompactd(int nid, bool want_thp);

#else
static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bc96a23..4532585 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -766,6 +766,10 @@ typedef struct pglist_data {
struct task_struct *kcompactd;
wait_queue_head_t kcompactd_wait;
#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ bool kcompactd_want_thp;
+#endif
+
} pg_data_t;

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/mm/compaction.c b/mm/compaction.c
index fcbc093..027a2e0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1723,8 +1723,13 @@ void compaction_unregister_node(struct node *node)
/*
* Has any special work been requested of kcompactd?
*/
-static bool kcompactd_work_requested(pg_data_t *pgdat)
+bool kcompactd_work_requested(pg_data_t *pgdat)
{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (pgdat->kcompactd_want_thp)
+ return true;
+#endif
+
return false;
}

@@ -1738,6 +1743,13 @@ static void kcompactd_do_work(pg_data_t *pgdat)
* With no special task, compact all zones so that a pageblock-order
* page is allocatable. Wake up kswapd if there's not enough free
* memory for compaction.
+ *
+ * //TODO: with thp requested, just do the same thing as usual. We
+ * could try really allocating a hugepage, but that would be
+ * reclaim+compaction. If kswapd reclaim and kcompactd compaction
+ * cannot yield a hugepage, it probably means the system is busy
+ * enough with allocation/reclaim and being aggressive about THP
+ * would be of little benefit?
*/
int zoneid;
struct zone *zone;
@@ -1747,6 +1759,15 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.ignore_skip_hint = true,
};

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * Clear the flag regardless of success. If somebody still wants a
+ * hugepage, they will set it again.
+ */
+ if (pgdat->kcompactd_want_thp)
+ pgdat->kcompactd_want_thp = false;
+#endif
+
for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {

int suitable;
@@ -1778,13 +1799,26 @@ static void kcompactd_do_work(pg_data_t *pgdat)
compact_zone(zone, &cc);

if (zone_watermark_ok(zone, cc.order,
- low_wmark_pages(zone), 0, 0))
+ low_wmark_pages(zone), 0, 0)) {
compaction_defer_reset(zone, cc.order, false);
+ thp_avail_set(pgdat->node_id);
+ }

VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
}
+}
+
+void wakeup_kcompactd(int nid, bool want_thp)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (want_thp)
+ pgdat->kcompactd_want_thp = true;
+#endif

+ wake_up_interruptible(&pgdat->kcompactd_wait);
}

/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6d83d05..885cb4e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -22,6 +22,7 @@
#include <linux/mman.h>
#include <linux/pagemap.h>
#include <linux/migrate.h>
+#include <linux/compaction.h>
#include <linux/hashtable.h>

#include <asm/tlb.h>
@@ -103,6 +104,7 @@ static struct khugepaged_scan khugepaged_scan = {
.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
};

+nodemask_t thp_avail_nodes = NODE_MASK_ALL;

static int set_recommended_min_free_kbytes(void)
{
@@ -2273,6 +2275,14 @@ static bool khugepaged_scan_abort(int nid)
int i;

/*
+ * If it's clear that we are going to select a node where THP
+ * allocation is unlikely to succeed, abort
+ */
+ if (khugepaged_node_load[nid] == (HPAGE_PMD_NR / 2) &&
+ !node_isset(nid, thp_avail_nodes))
+ return true;
+
+ /*
* If zone_reclaim_mode is disabled, then no extra effort is made to
* allocate memory locally.
*/
@@ -2346,6 +2356,7 @@ static struct page
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
+ node_clear(node, thp_avail_nodes);
return NULL;
}

@@ -2353,6 +2364,31 @@ static struct page
return *hpage;
}

+/*
+ * Return true, if THP should be allocatable on at least one node.
+ * Wake up kcompactd for nodes where THP is not available.
+ */
+static bool khugepaged_check_nodes(void)
+{
+ bool ret = false;
+ int nid;
+
+ for_each_online_node(nid) {
+ if (node_isset(nid, thp_avail_nodes)) {
+ ret = true;
+ continue;
+ }
+
+ /*
+ * Tell kcompactd we want a hugepage available. It will
+ * set the thp_avail_nodes when successful.
+ */
+ wakeup_kcompactd(nid, true);
+ }
+
+ return ret;
+}
+
static bool hugepage_vma_check(struct vm_area_struct *vma)
{
if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
@@ -2580,6 +2616,10 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
pte_unmap_unlock(pte, ptl);
if (ret) {
node = khugepaged_find_target_node();
+ if (!node_isset(node, thp_avail_nodes)) {
+ ret = 0;
+ goto out;
+ }
/* collapse_huge_page will return with the mmap_sem released */
collapse_huge_page(mm, address, hpage, vma, node);
}
@@ -2730,12 +2770,16 @@ static int khugepaged_wait_event(void)
kthread_should_stop();
}

-static void khugepaged_do_scan(void)
+/* Return false if THP allocation failed, true otherwise */
+static bool khugepaged_do_scan(void)
{
struct page *hpage = NULL;
unsigned int progress = 0, pass_through_head = 0;
unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);

+ if (!khugepaged_check_nodes())
+ return false;
+
while (progress < pages) {
cond_resched();

@@ -2754,14 +2798,14 @@ static void khugepaged_do_scan(void)
spin_unlock(&khugepaged_mm_lock);

/* THP allocation has failed during collapse */
- if (IS_ERR(hpage)) {
- khugepaged_alloc_sleep();
- break;
- }
+ if (IS_ERR(hpage))
+ return false;
}

if (!IS_ERR_OR_NULL(hpage))
put_page(hpage);
+
+ return true;
}

static void khugepaged_wait_work(void)
@@ -2790,8 +2834,10 @@ static int khugepaged(void *none)
set_user_nice(current, MAX_NICE);

while (!kthread_should_stop()) {
- khugepaged_do_scan();
- khugepaged_wait_work();
+ if (khugepaged_do_scan())
+ khugepaged_wait_work();
+ else
+ khugepaged_alloc_sleep();
}

spin_lock(&khugepaged_mm_lock);
diff --git a/mm/internal.h b/mm/internal.h
index a25e359..6d9a711 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -162,6 +162,45 @@ extern bool is_free_buddy_page(struct page *page);
#endif
extern int user_min_free_kbytes;

+/*
+ * in mm/huge_memory.c
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
+extern nodemask_t thp_avail_nodes;
+
+static inline bool thp_avail_isset(int nid)
+{
+ return node_isset(nid, thp_avail_nodes);
+}
+
+static inline void thp_avail_set(int nid)
+{
+ node_set(nid, thp_avail_nodes);
+}
+
+static inline void thp_avail_clear(int nid)
+{
+ node_clear(nid, thp_avail_nodes);
+}
+
+#else
+
+static inline bool thp_avail_isset(int nid)
+{
+ return true;
+}
+
+static inline void thp_avail_set(int nid)
+{
+}
+
+static inline void thp_avail_clear(int nid)
+{
+}
+
+#endif
+
#if defined CONFIG_COMPACTION || defined CONFIG_CMA

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..d91e4d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3322,6 +3322,13 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
*/
reset_isolation_suitable(pgdat);

+ /*
+ * If kcompactd has work to do, it's possible that it was
+ * waiting for kswapd to reclaim enough memory first.
+ */
+ if (kcompactd_work_requested(pgdat))
+ wakeup_kcompactd(pgdat->node_id, false);
+
if (!kthread_should_stop())
schedule();

--
2.4.3

2015-07-02 08:48:31

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC 4/4] mm, thp: check hugepage availability for fault allocations

Since we track hugepage availability for khugepaged THP collapses, we can use
it also for page fault THP allocations. If hugepages are considered unavailable
on a node, the cost of reclaim/compaction during the page fault could be easily
higher than any THP benefits (if it at all succeeds), so we better fallback to
base pages instead.

We clear the THP availability flag for a node if we do attempt and fail to
allocate during page fault. Kcompactd is woken up both in the case of attempt
skipped due to assumed non-availability, and for the truly failed allocation.
This is to fully translate the need for hugepages into kcompactd activity.

With this patch we also set the availability flag if we are freeing a large
enough page from any context, to prevent false negatives after e.g. a process
quits and another immediately starts. This does not consider high-order pages
created by buddy merging, as that would need modifying the fast path and is
unlikely to make much difference.

Note that in case of false-positive hugepage availability, the allocation
attempt may still result in a limited direct compaction, depending on
/sys/kernel/mm/transparent_hugepage/defrag which defaults to "always". The
default could be later changed to e.g. "madvise" to eliminate all of the page
fault latency related to THP and rely exlusively on kcompactd.

Also restructure alloc_pages_vma() a bit.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/mempolicy.c | 42 +++++++++++++++++++++++++++---------------
mm/page_alloc.c | 3 +++
2 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7477432..502e173 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -94,6 +94,7 @@
#include <linux/mm_inline.h>
#include <linux/mmu_notifier.h>
#include <linux/printk.h>
+#include <linux/compaction.h>

#include <asm/tlbflush.h>
#include <asm/uaccess.h>
@@ -1963,17 +1964,34 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, int node, bool hugepage)
{
struct mempolicy *pol;
- struct page *page;
+ struct page *page = NULL;
unsigned int cpuset_mems_cookie;
struct zonelist *zl;
nodemask_t *nmask;

+ /* Help compiler eliminate code */
+ hugepage = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage;
+
retry_cpuset:
pol = get_vma_policy(vma, addr);
cpuset_mems_cookie = read_mems_allowed_begin();

- if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage &&
- pol->mode != MPOL_INTERLEAVE)) {
+ if (pol->mode == MPOL_INTERLEAVE) {
+ unsigned nid;
+
+ nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
+ mpol_cond_put(pol);
+ if (!hugepage || thp_avail_isset(nid))
+ page = alloc_page_interleave(gfp, order, nid);
+ if (hugepage && !page) {
+ thp_avail_clear(nid);
+ wakeup_kcompactd(nid, true);
+ }
+ goto out;
+ }
+
+ nmask = policy_nodemask(gfp, pol);
+ if (hugepage) {
/*
* For hugepage allocation and non-interleave policy which
* allows the current node, we only try to allocate from the
@@ -1983,25 +2001,19 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
* If the policy is interleave, or does not allow the current
* node in its nodemask, we allocate the standard way.
*/
- nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(node, *nmask)) {
mpol_cond_put(pol);
- page = alloc_pages_exact_node(node,
+ if (thp_avail_isset(node))
+ page = alloc_pages_exact_node(node,
gfp | __GFP_THISNODE, order);
+ if (!page) {
+ thp_avail_clear(node);
+ wakeup_kcompactd(node, true);
+ }
goto out;
}
}

- if (pol->mode == MPOL_INTERLEAVE) {
- unsigned nid;
-
- nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
- mpol_cond_put(pol);
- page = alloc_page_interleave(gfp, order, nid);
- goto out;
- }
-
- nmask = policy_nodemask(gfp, pol);
zl = policy_zonelist(gfp, pol, node);
mpol_cond_put(pol);
page = __alloc_pages_nodemask(gfp, order, zl, nmask);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9cd834..ccd87b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -830,6 +830,9 @@ static void __free_pages_ok(struct page *page, unsigned int order)
set_freepage_migratetype(page, migratetype);
free_one_page(page_zone(page), page, pfn, order, migratetype);
local_irq_restore(flags);
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
+ && order >= HPAGE_PMD_ORDER)
+ thp_avail_set(page_to_nid(page));
}

void __init __free_pages_bootmem(struct page *page, unsigned int order)
--
2.4.3

2015-07-09 21:53:38

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Thu, 2 Jul 2015, Vlastimil Babka wrote:

> Memory compaction can be currently performed in several contexts:
>
> - kswapd balancing a zone after a high-order allocation failure
> - direct compaction to satisfy a high-order allocation, including THP page
> fault attemps
> - khugepaged trying to collapse a hugepage
> - manually from /proc
>
> The purpose of compaction is two-fold. The obvious purpose is to satisfy a
> (pending or future) high-order allocation, and is easy to evaluate. The other
> purpose is to keep overal memory fragmentation low and help the
> anti-fragmentation mechanism. The success wrt the latter purpose is more
> difficult to evaluate.
>
> The current situation wrt the purposes has a few drawbacks:
>
> - compaction is invoked only when a high-order page or hugepage is not
> available (or manually). This might be too late for the purposes of keeping
> memory fragmentation low.
> - direct compaction increases latency of allocations. Again, it would be
> better if compaction was performed asynchronously to keep fragmentation low,
> before the allocation itself comes.
> - (a special case of the previous) the cost of compaction during THP page
> faults can easily offset the benefits of THP.
>
> To improve the situation, we need an equivalent of kswapd, but for compaction.
> E.g. a background thread which responds to fragmentation and the need for
> high-order allocations (including hugepages) somewhat proactively.
>
> One possibility is to extend the responsibilities of kswapd, which could
> however complicate its design too much. It should be better to let kswapd
> handle reclaim, as order-0 allocations are often more critical than high-order
> ones.
>
> Another possibility is to extend khugepaged, but this kthread is a single
> instance and tied to THP configs.
>
> This patch goes with the option of a new set of per-node kthreads called
> kcompactd, and lays the foundations. The lifecycle mimics kswapd kthreads.
>
> The work loop of kcompactd currently mimics an pageblock-order direct
> compaction attempt each 15 seconds. This might not be enough to keep
> fragmentation low, and needs evaluation.
>
> When there's not enough free memory for compaction, kswapd is woken up for
> reclaim only (not compaction/reclaim).
>
> Further patches will add the ability to wake up kcompactd on demand in special
> situations such as when hugepages are not available, or when a fragmentation
> event occured.
>

Thanks for looking at this again.

The code is certainly clean and the responsibilities vs kswapd and
khugepaged are clearly defined, but I'm not sure how receptive others
would be of another per-node kthread.

Khugepaged benefits from the periodic memory compaction being done
immediately before it attempts to compact memory, and that may be lost
with a de-coupled approach like this.

Initially, I suggested implementing this inside khugepaged for that
purpose, and the full compaction could be done on the next
scan_sleep_millisecs wakeup before allocating a hugepage and when
kcompactd_sleep_millisecs would have expired. So the true period between
memory compaction events could actually be
kcompactd_sleep_millisecs - scan_sleep_millisecs.

You bring up an interesting point, though, about non-hugepage uses of
memory compaction and its effect on keeping fragmentation low. I'm not
sure of any reports of that actually being an issue in the wild?

I know that the networking layer has done work recently to reduce page
allocator latency for high-order allocations that can easily fallback to
order-0 memory: see commit fb05e7a89f50 ("net: don't wait for order-3 page
allocation").

The slub allocator does try to allocate its high-order memory with
__GFP_WAIT before falling back to lower orders if possible. I would think
that this would be the greatest sign of on-demand memory compaction being
a problem, especially since CONFIG_SLUB is the default, but I haven't seen
such reports.

So I'm inclined to think that the current trouble spot for memory
compaction is thp allocations. I may live to find differently :)

How would you feel about implementing this as part of the khugepaged loop
before allocating a hugepage and scanning memory?

2015-07-21 09:04:05

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On 07/09/2015 11:53 PM, David Rientjes wrote:
> On Thu, 2 Jul 2015, Vlastimil Babka wrote:
>
>> Memory compaction can be currently performed in several contexts:
>>
>> - kswapd balancing a zone after a high-order allocation failure
>> - direct compaction to satisfy a high-order allocation, including THP page
>> fault attemps
>> - khugepaged trying to collapse a hugepage
>> - manually from /proc
>>
>> The purpose of compaction is two-fold. The obvious purpose is to satisfy a
>> (pending or future) high-order allocation, and is easy to evaluate. The other
>> purpose is to keep overal memory fragmentation low and help the
>> anti-fragmentation mechanism. The success wrt the latter purpose is more
>> difficult to evaluate.
>>
>> The current situation wrt the purposes has a few drawbacks:
>>
>> - compaction is invoked only when a high-order page or hugepage is not
>> available (or manually). This might be too late for the purposes of keeping
>> memory fragmentation low.
>> - direct compaction increases latency of allocations. Again, it would be
>> better if compaction was performed asynchronously to keep fragmentation low,
>> before the allocation itself comes.
>> - (a special case of the previous) the cost of compaction during THP page
>> faults can easily offset the benefits of THP.
>>
>> To improve the situation, we need an equivalent of kswapd, but for compaction.
>> E.g. a background thread which responds to fragmentation and the need for
>> high-order allocations (including hugepages) somewhat proactively.
>>
>> One possibility is to extend the responsibilities of kswapd, which could
>> however complicate its design too much. It should be better to let kswapd
>> handle reclaim, as order-0 allocations are often more critical than high-order
>> ones.
>>
>> Another possibility is to extend khugepaged, but this kthread is a single
>> instance and tied to THP configs.
>>
>> This patch goes with the option of a new set of per-node kthreads called
>> kcompactd, and lays the foundations. The lifecycle mimics kswapd kthreads.
>>
>> The work loop of kcompactd currently mimics an pageblock-order direct
>> compaction attempt each 15 seconds. This might not be enough to keep
>> fragmentation low, and needs evaluation.
>>
>> When there's not enough free memory for compaction, kswapd is woken up for
>> reclaim only (not compaction/reclaim).
>>
>> Further patches will add the ability to wake up kcompactd on demand in special
>> situations such as when hugepages are not available, or when a fragmentation
>> event occured.
>>
>
> Thanks for looking at this again.
>
> The code is certainly clean and the responsibilities vs kswapd and
> khugepaged are clearly defined, but I'm not sure how receptive others
> would be of another per-node kthread.

We'll hopefully see...

> Khugepaged benefits from the periodic memory compaction being done
> immediately before it attempts to compact memory, and that may be lost
> with a de-coupled approach like this.

That could be helped with waking up khugepaged after kcompactd is
successful in making a hugepage available. Also in your rfc you propose
the compaction period to be 15 minutes, while khugepaged wakes up each
10 (or 30) seconds by default for the scanning and collapsing, so only
fraction of the work is attempted right after the compaction anyway?

> Initially, I suggested implementing this inside khugepaged for that
> purpose, and the full compaction could be done on the next
> scan_sleep_millisecs wakeup before allocating a hugepage and when
> kcompactd_sleep_millisecs would have expired. So the true period between
> memory compaction events could actually be
> kcompactd_sleep_millisecs - scan_sleep_millisecs.
>
> You bring up an interesting point, though, about non-hugepage uses of
> memory compaction and its effect on keeping fragmentation low. I'm not
> sure of any reports of that actually being an issue in the wild?

Hm reports of even not-so-high-order allocation failures occur from time
to time. Some might be from atomic context, but some are because
compaction just can't help due to the unmovable fragmentation. That's
mostly a guess, since such detailed information isn't there, but I think
Joonsoo did some experiments that confirmed this.

Also effects on the fragmentation are evaluated when making changes to
compaction, see e.g. http://marc.info/?l=linux-mm&m=143634369227134&w=2
In the past it has prevented changes that would improve latency of
direct compaction. They might be possible if there was a reliable source
of more thorough periodic compaction to counter the not-so-thorough
direct compaction.

> I know that the networking layer has done work recently to reduce page
> allocator latency for high-order allocations that can easily fallback to
> order-0 memory: see commit fb05e7a89f50 ("net: don't wait for order-3 page
> allocation").

Yep.

> The slub allocator does try to allocate its high-order memory with
> __GFP_WAIT before falling back to lower orders if possible. I would think
> that this would be the greatest sign of on-demand memory compaction being
> a problem, especially since CONFIG_SLUB is the default, but I haven't seen
> such reports.

Hm it's true I don't remember such report in the slub context.

> So I'm inclined to think that the current trouble spot for memory
> compaction is thp allocations. I may live to find differently :)

Yeah it's the most troublesome one, but I wouldn't discount the others.

> How would you feel about implementing this as part of the khugepaged loop
> before allocating a hugepage and scanning memory?

Yeah that's what the previous version did:

http://thread.gmane.org/gmane.linux.kernel.mm/132522

But I found it increasingly clumsy and something that should not depend
on CONFIG_THP only.

2015-07-21 23:07:07

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Tue, 21 Jul 2015, Vlastimil Babka wrote:

> > Khugepaged benefits from the periodic memory compaction being done
> > immediately before it attempts to compact memory, and that may be lost
> > with a de-coupled approach like this.
>

Meant to say "before it attempts to allocate a hugepage", but it seems you
understood that :)

> That could be helped with waking up khugepaged after kcompactd is successful
> in making a hugepage available.

I don't think the criteria for waking up khugepaged should become any more
complex beyond its current state, which is impacted by two different
tunables, and whether it actually has memory to scan. During this
additional wakeup, you'd also need to pass kcompactd's node and only do
local khugepaged scanning since there's no guarantee khugepaged can
allocate on all nodes when one kcompactd defragments memory. I think
coupling these two would be too complex and not worth it.

> Also in your rfc you propose the compaction
> period to be 15 minutes, while khugepaged wakes up each 10 (or 30) seconds by
> default for the scanning and collapsing, so only fraction of the work is
> attempted right after the compaction anyway?
>

The rfc actually proposes the compaction period to be 0, meaning it's
disabled, but suggests in the changelog that we have seen a reproducible
benefit with the period of 15m.

I'm not concerned about scan_sleep_millisecs here, if khugepaged was able
to successfully allocate in its last scan. I'm only concerned with
alloc_sleep_millisecs which defaults to 60000. I think it would be
unfortunate if kcompactd were to free a pageblock, and then khugepaged
waits for 60s before allocating.

> Hm reports of even not-so-high-order allocation failures occur from time to
> time. Some might be from atomic context, but some are because compaction just
> can't help due to the unmovable fragmentation. That's mostly a guess, since
> such detailed information isn't there, but I think Joonsoo did some
> experiments that confirmed this.
>

If it's unmovable fragmentation, then any periodic synchronous memory
compaction isn't going to help either. The page allocator already does
MIGRATE_SYNC_LIGHT compaction on its second pass and that will terminate
when a high-order page is available. If it is currently failing, then I
don't see the benefit of synchronous memory compaction over all memory
that would substantially help this case.

> Also effects on the fragmentation are evaluated when making changes to
> compaction, see e.g. http://marc.info/?l=linux-mm&m=143634369227134&w=2
> In the past it has prevented changes that would improve latency of direct
> compaction. They might be possible if there was a reliable source of more
> thorough periodic compaction to counter the not-so-thorough direct compaction.
>

Hmm, I don't think we have to select one to the excusion of the other. I
don't think that because khugepaged may do periodic synchronous memory
compaction (to eventually remove direct compaction entirely from the page
fault path, since we have checks in the page allocator that specifically
do that) that we can't do background memory compaction elsewhere. I think
it would be trivial to schedule a workqueue in the page allocator when
MIGRATE_ASYNC compaction fails for a high-order allocation on a node and
to have that local compaction done in the background.

2015-07-22 15:23:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On 07/22/2015 01:07 AM, David Rientjes wrote:
> On Tue, 21 Jul 2015, Vlastimil Babka wrote:
>
>>> Khugepaged benefits from the periodic memory compaction being done
>>> immediately before it attempts to compact memory, and that may be lost
>>> with a de-coupled approach like this.
>>
>
> Meant to say "before it attempts to allocate a hugepage", but it seems you
> understood that :)

Right :)

>> That could be helped with waking up khugepaged after kcompactd is successful
>> in making a hugepage available.
>
> I don't think the criteria for waking up khugepaged should become any more
> complex beyond its current state, which is impacted by two different
> tunables, and whether it actually has memory to scan. During this
> additional wakeup, you'd also need to pass kcompactd's node and only do
> local khugepaged scanning since there's no guarantee khugepaged can
> allocate on all nodes when one kcompactd defragments memory.

Keeping track of the nodes where hugepage allocations are expected to
succeed is already done in this series. "local khugepaged scanning" is
unfortunately not possible in general, since the node that will be used
for a given pmd is not known until half of pte's (or more) are scanned.

> I think
> coupling these two would be too complex and not worth it.

It wouldn't be that complex (see above), and go away if khugepaged
scanning is converted to deferred task work. In that case it's also
possible to assume that it's only worth touching memory local to the
task, so if that node indicates no available hugepages, the scanning can
be skipped.

>> Also in your rfc you propose the compaction
>> period to be 15 minutes, while khugepaged wakes up each 10 (or 30) seconds by
>> default for the scanning and collapsing, so only fraction of the work is
>> attempted right after the compaction anyway?
>>
>
> The rfc actually proposes the compaction period to be 0, meaning it's
> disabled, but suggests in the changelog that we have seen a reproducible
> benefit with the period of 15m.

Ah, right.

> I'm not concerned about scan_sleep_millisecs here, if khugepaged was able
> to successfully allocate in its last scan. I'm only concerned with
> alloc_sleep_millisecs which defaults to 60000. I think it would be
> unfortunate if kcompactd were to free a pageblock, and then khugepaged
> waits for 60s before allocating.

Don't forget that khugepaged has to find a suitable pmd first, which can
take much longer than 60s. It might be rescanning address spaces that
have no candidates, or processes that are sleeping and wouldn't benefit
from THP. Another potential advantage for doing the scanning and
collapses in task context...

>> Hm reports of even not-so-high-order allocation failures occur from time to
>> time. Some might be from atomic context, but some are because compaction just
>> can't help due to the unmovable fragmentation. That's mostly a guess, since
>> such detailed information isn't there, but I think Joonsoo did some
>> experiments that confirmed this.
>>
>
> If it's unmovable fragmentation, then any periodic synchronous memory
> compaction isn't going to help either.

It can help if it moves away movable pages out of unmovable pageblocks,
so the following unmovable allocations can be served from those
pageblocks and not fallback to pollute another movable pageblock. Even
better if this is done (kcompactd woken up) in response to such
fallback, where unmovable page falls to a partially filled movable
pageblock. Stuffing also this into khugepaged would be really a stretch.
Joonsoo proposed another daemon for that in
https://lkml.org/lkml/2015/4/27/94 but extending kcompactd would be a
very natural way for this.

> The page allocator already does
> MIGRATE_SYNC_LIGHT compaction on its second pass and that will terminate
> when a high-order page is available. If it is currently failing, then I
> don't see the benefit of synchronous memory compaction over all memory
> that would substantially help this case.

The sync compaction is no longer done for THP page faults, so if there's
no other source of the sync compaction, system can fragment over time
and then it might be too late when the need comes.

>> Also effects on the fragmentation are evaluated when making changes to
>> compaction, see e.g. http://marc.info/?l=linux-mm&m=143634369227134&w=2
>> In the past it has prevented changes that would improve latency of direct
>> compaction. They might be possible if there was a reliable source of more
>> thorough periodic compaction to counter the not-so-thorough direct compaction.
>>
>
> Hmm, I don't think we have to select one to the excusion of the other. I
> don't think that because khugepaged may do periodic synchronous memory
> compaction (to eventually remove direct compaction entirely from the page
> fault path, since we have checks in the page allocator that specifically
> do that)

That would be nice for the THP page faults, yes. Or maybe just change
the default for thp "defrag" tunable to "madvise".

> that we can't do background memory compaction elsewhere. I think
> it would be trivial to schedule a workqueue in the page allocator when
> MIGRATE_ASYNC compaction fails for a high-order allocation on a node and
> to have that local compaction done in the background.

I think pushing compaction in a workqueue would meet a bigger resistance
than new kthreads. It could be too heavyweight for this mechanism and
what if there's suddenly lots of allocations in parallel failing and
scheduling the work items? So if we do it elsewhere, I think it's best
as kcompactd kthreads and then why would we do it also in khugepaged?

I guess a broader input than just us two would help :)

2015-07-22 22:36:59

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Wed, 22 Jul 2015, Vlastimil Babka wrote:

> > I don't think the criteria for waking up khugepaged should become any more
> > complex beyond its current state, which is impacted by two different
> > tunables, and whether it actually has memory to scan. During this
> > additional wakeup, you'd also need to pass kcompactd's node and only do
> > local khugepaged scanning since there's no guarantee khugepaged can
> > allocate on all nodes when one kcompactd defragments memory.
>
> Keeping track of the nodes where hugepage allocations are expected to succeed
> is already done in this series. "local khugepaged scanning" is unfortunately
> not possible in general, since the node that will be used for a given pmd is
> not known until half of pte's (or more) are scanned.
>

When a khugepaged allocation fails for a node, it could easily kick off
background compaction on that node and revisit the range later, very
similar to how we can kick off background compaction in the page allocator
when async or sync_light compaction fails.

The distinction I'm trying to draw is between "periodic" and "background"
compaction. I think there're usecases for both and we shouldn't be
limiting ourselves to one or the other.

Periodic compaction would wakeup at a user-defined period and fully
compact memory over all nodes, round-robin at each wakeup. This keeps
fragmentation low so that (ideally) background compaction or direct
compaction wouldn't be needed.

Background compaction would be triggered from the page allocator when
async or sync_light compaction fails, regardless of whether this was from
khugepaged, page fault context, or any other high-order allocation. This
is an interesting discussion because I can think of lots of ways to be
smart about it, but I haven't tried to implement it yet: heuristics that
do ratelimiting, preemptive compaction based on fragmentation stats, etc.

My rfc implements periodic compaction in khugepaged simply because we find
very large thp_fault_fallback numbers and these faults tend to come in
bunches so that background compaction wouldn't really help the situation
itself: it's simply not fast enough and we give up compaction at fault way
too early for it have a chance of being successful. I have a hard time
finding other examples of that outside thp, especially at such large
orders. The number one culprit that I can think of would be slub and I
haven't seen any complaints about high order_fallback stats.

The additional benefit of doing the periodic compaction in khugepaged is
that we can do it before scanning, where alloc_sleep_millisecs is so high
that kicking off background compaction on allocation failure wouldn't
help.

Then, storing the nodes where khugepaged allocation has failed isn't
needed: the allocation itself would trigger background compaction.

> > If it's unmovable fragmentation, then any periodic synchronous memory
> > compaction isn't going to help either.
>
> It can help if it moves away movable pages out of unmovable pageblocks, so the
> following unmovable allocations can be served from those pageblocks and not
> fallback to pollute another movable pageblock. Even better if this is done
> (kcompactd woken up) in response to such fallback, where unmovable page falls
> to a partially filled movable pageblock. Stuffing also this into khugepaged
> would be really a stretch. Joonsoo proposed another daemon for that in
> https://lkml.org/lkml/2015/4/27/94 but extending kcompactd would be a very
> natural way for this.
>

Sure, this is an example of why background compaction would be helpful and
triggered by the page allocator when async or migrate_sync allocation
fails.

> > Hmm, I don't think we have to select one to the excusion of the other. I
> > don't think that because khugepaged may do periodic synchronous memory
> > compaction (to eventually remove direct compaction entirely from the page
> > fault path, since we have checks in the page allocator that specifically
> > do that)
>
> That would be nice for the THP page faults, yes. Or maybe just change the
> default for thp "defrag" tunable to "madvise".
>

Right, however I'm afraid that what we have done to compaction in the
fault path for MIGRATE_ASYNC has been implicitly change that default in
the code :) I have examples where async compaction in the fault path
scans three pageblocks and gives up because of the abort heuristics,
that's not suggesting that we'll be very successful. The hope is that we
can change the default to "madvise" due to periodic and background
compaction and then make the "always" case do some actual defrag :)

> I think pushing compaction in a workqueue would meet a bigger resistance than
> new kthreads. It could be too heavyweight for this mechanism and what if
> there's suddenly lots of allocations in parallel failing and scheduling the
> work items? So if we do it elsewhere, I think it's best as kcompactd kthreads
> and then why would we do it also in khugepaged?
>

We'd need the aforementioned ratelimiting to ensure that background
compaction is handled appropriately, absolutely.

2015-07-23 05:59:31

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

Hello,

On Thu, Jul 09, 2015 at 02:53:27PM -0700, David Rientjes wrote:

> The slub allocator does try to allocate its high-order memory with
> __GFP_WAIT before falling back to lower orders if possible. I would think
> that this would be the greatest sign of on-demand memory compaction being
> a problem, especially since CONFIG_SLUB is the default, but I haven't seen
> such reports.

In fact, some of our product had trouble with slub's high order
allocation 5 months ago. At that time, compaction didn't make high order
page and compaction attempts are frequently deferred. It also causes many
reclaim to make high order page so I suggested masking out __GFP_WAIT
and adding __GFP_NO_KSWAPD when trying slub's high order allocation to
reduce reclaim/compaction overhead. Although using high order page in slub
has some gains that reducing internal fragmentation and reducing management
overhead, benefit is marginal compared to the cost at making high order
page. This solution improves system response time for our case. I planned
to submit the patch but it is delayed due to my laziness. :)

Thanks.

2015-07-23 09:18:58

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On 07/23/2015 12:36 AM, David Rientjes wrote:
> On Wed, 22 Jul 2015, Vlastimil Babka wrote:
>
>> > I don't think the criteria for waking up khugepaged should become any more
>> > complex beyond its current state, which is impacted by two different
>> > tunables, and whether it actually has memory to scan. During this
>> > additional wakeup, you'd also need to pass kcompactd's node and only do
>> > local khugepaged scanning since there's no guarantee khugepaged can
>> > allocate on all nodes when one kcompactd defragments memory.
>>
>> Keeping track of the nodes where hugepage allocations are expected to succeed
>> is already done in this series. "local khugepaged scanning" is unfortunately
>> not possible in general, since the node that will be used for a given pmd is
>> not known until half of pte's (or more) are scanned.
>>
>
> When a khugepaged allocation fails for a node, it could easily kick off
> background compaction on that node and revisit the range later, very
> similar to how we can kick off background compaction in the page allocator
> when async or sync_light compaction fails.

The revisiting sounds rather complicated. Page allocator doesn't have to do that.

> The distinction I'm trying to draw is between "periodic" and "background"
> compaction. I think there're usecases for both and we shouldn't be
> limiting ourselves to one or the other.

OK, I understand you think we can have both, and the periodic one would be in
khugepaged. My main concern is that if we do the periodic one in khugepaged,
people might oppose adding yet another one as kcompactd. I hope we agree that
khugepaged is not suitable for all the use cases of the background one.

My secondary concern/opinion is that I would hope that the background compaction
would be good enough to remove the need for the periodic one. So I would try the
background one first. But I understand the periodic one is simpler to implement.
On the other hand, it's not as urgent if you can simulate it from userspace.
With the 15min period you use, there's likely not much overhead saved when
invoking it from within the kernel? Sure there wouldn't be the synchronization
with khugepaged activity, but I still wonder if wiating for up to 1 minute
before khugepaged wakes up can make much difference with the 15min period.
Hm, your cron job could also perhaps adjust the khugepaged sleep tunable when
compaction is done, which IIRC results in immediate wakeup.

> Periodic compaction would wakeup at a user-defined period and fully
> compact memory over all nodes, round-robin at each wakeup. This keeps
> fragmentation low so that (ideally) background compaction or direct
> compaction wouldn't be needed.
>
> Background compaction would be triggered from the page allocator when
> async or sync_light compaction fails, regardless of whether this was from
> khugepaged, page fault context, or any other high-order allocation. This
> is an interesting discussion because I can think of lots of ways to be
> smart about it, but I haven't tried to implement it yet: heuristics that
> do ratelimiting, preemptive compaction based on fragmentation stats, etc.

Yes.

> My rfc implements periodic compaction in khugepaged simply because we find
> very large thp_fault_fallback numbers and these faults tend to come in
> bunches so that background compaction wouldn't really help the situation
> itself: it's simply not fast enough and we give up compaction at fault way
> too early for it have a chance of being successful. I have a hard time
> finding other examples of that outside thp, especially at such large
> orders. The number one culprit that I can think of would be slub and I
> haven't seen any complaints about high order_fallback stats.
>
> The additional benefit of doing the periodic compaction in khugepaged is
> that we can do it before scanning, where alloc_sleep_millisecs is so high
> that kicking off background compaction on allocation failure wouldn't
> help.
>
> Then, storing the nodes where khugepaged allocation has failed isn't
> needed: the allocation itself would trigger background compaction.

The storing is more useful for THP page faults as it prevents further direct
reclaim and compaction attempts (potentially interferring with the background
compaction), until the triggered background compaction succeeds. The assumption
is that the attempts would likely fail anyway and just increase the page fault
latency. You could see it as a simple rate limiting too.

>> > If it's unmovable fragmentation, then any periodic synchronous memory
>> > compaction isn't going to help either.
>>
>> It can help if it moves away movable pages out of unmovable pageblocks, so the
>> following unmovable allocations can be served from those pageblocks and not
>> fallback to pollute another movable pageblock. Even better if this is done
>> (kcompactd woken up) in response to such fallback, where unmovable page falls
>> to a partially filled movable pageblock. Stuffing also this into khugepaged
>> would be really a stretch. Joonsoo proposed another daemon for that in
>> https://lkml.org/lkml/2015/4/27/94 but extending kcompactd would be a very
>> natural way for this.
>>
>
> Sure, this is an example of why background compaction would be helpful and
> triggered by the page allocator when async or migrate_sync allocation
> fails.
>
>> > Hmm, I don't think we have to select one to the excusion of the other. I
>> > don't think that because khugepaged may do periodic synchronous memory
>> > compaction (to eventually remove direct compaction entirely from the page
>> > fault path, since we have checks in the page allocator that specifically
>> > do that)
>>
>> That would be nice for the THP page faults, yes. Or maybe just change the
>> default for thp "defrag" tunable to "madvise".
>>
>
> Right, however I'm afraid that what we have done to compaction in the
> fault path for MIGRATE_ASYNC has been implicitly change that default in
> the code :) I have examples where async compaction in the fault path
> scans three pageblocks and gives up because of the abort heuristics,
> that's not suggesting that we'll be very successful. The hope is that we
> can change the default to "madvise" due to periodic and background
> compaction and then make the "always" case do some actual defrag :)

OK.

>> I think pushing compaction in a workqueue would meet a bigger resistance than
>> new kthreads. It could be too heavyweight for this mechanism and what if
>> there's suddenly lots of allocations in parallel failing and scheduling the
>> work items? So if we do it elsewhere, I think it's best as kcompactd kthreads
>> and then why would we do it also in khugepaged?
>>
>
> We'd need the aforementioned ratelimiting to ensure that background
> compaction is handled appropriately, absolutely.

So we would limit the number of work items, but a single work item could still
be very heavyweight. I'm not sure it would be perceived well, as well as lack of
accountability. Kthread is still better for this type of work IMHO.

2015-07-23 20:58:26

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Thu, 23 Jul 2015, Joonsoo Kim wrote:

> > The slub allocator does try to allocate its high-order memory with
> > __GFP_WAIT before falling back to lower orders if possible. I would think
> > that this would be the greatest sign of on-demand memory compaction being
> > a problem, especially since CONFIG_SLUB is the default, but I haven't seen
> > such reports.
>
> In fact, some of our product had trouble with slub's high order
> allocation 5 months ago. At that time, compaction didn't make high order
> page and compaction attempts are frequently deferred. It also causes many
> reclaim to make high order page so I suggested masking out __GFP_WAIT
> and adding __GFP_NO_KSWAPD when trying slub's high order allocation to
> reduce reclaim/compaction overhead. Although using high order page in slub
> has some gains that reducing internal fragmentation and reducing management
> overhead, benefit is marginal compared to the cost at making high order
> page. This solution improves system response time for our case. I planned
> to submit the patch but it is delayed due to my laziness. :)
>

Hi Joonsoo,

On a fragmented machine I can certainly understand that the overhead
involved in allocating the high-order page outweighs the benefit later and
it's better to fallback more quickly to page orders if the cache allows
it.

I believe that this would be improved by the suggestion of doing
background synchronous compaction. So regardless of whether __GFP_WAIT is
set, if the allocation fails then we can kick off background compaction
that will hopefully defragment memory for future callers. That should
make high-order atomic allocations more successful as well.

2015-07-23 21:21:35

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Thu, 23 Jul 2015, Vlastimil Babka wrote:

> > When a khugepaged allocation fails for a node, it could easily kick off
> > background compaction on that node and revisit the range later, very
> > similar to how we can kick off background compaction in the page allocator
> > when async or sync_light compaction fails.
>
> The revisiting sounds rather complicated. Page allocator doesn't have to do that.
>

I'm referring to khugepaged having a hugepage allocation fail, the page
allocator kicking off background compaction, and khugepaged rescanning the
same memory for which the allocation failed later.

> > The distinction I'm trying to draw is between "periodic" and "background"
> > compaction. I think there're usecases for both and we shouldn't be
> > limiting ourselves to one or the other.
>
> OK, I understand you think we can have both, and the periodic one would be in
> khugepaged. My main concern is that if we do the periodic one in khugepaged,
> people might oppose adding yet another one as kcompactd. I hope we agree that
> khugepaged is not suitable for all the use cases of the background one.
>

Yes, absolutely. I agree that we need the ability to do background
compaction without requiring CONFIG_TRANSPARENT_HUGEPAGE.

> My secondary concern/opinion is that I would hope that the background compaction
> would be good enough to remove the need for the periodic one. So I would try the
> background one first. But I understand the periodic one is simpler to implement.
> On the other hand, it's not as urgent if you can simulate it from userspace.
> With the 15min period you use, there's likely not much overhead saved when
> invoking it from within the kernel? Sure there wouldn't be the synchronization
> with khugepaged activity, but I still wonder if wiating for up to 1 minute
> before khugepaged wakes up can make much difference with the 15min period.
> Hm, your cron job could also perhaps adjust the khugepaged sleep tunable when
> compaction is done, which IIRC results in immediate wakeup.
>

There are certainly ways to do this from userspace, but the premise is
that this issue, specifically for users of thp, is significant for
everyone ;)

The problem that I've encountered with a background-only approach is that
it doesn't help when you exec a large process that wants to fault most of
its text and thp immediately cannot be allocated. This can be a result of
never having done any compaction at all other than from the page
allocator, which terminates when a page of the given order is available.
So on a fragmented machine, all memory faulted is shown in
thp_fault_fallback and we rely on khugepaged to (slowly) fix this problem
up for us. We have shown great improvement in cpu utilization by
periodically compacting memory today.

Background compaction arguably wouldn't help that situation because it's
not fast enough to compact memory simultaneous to the large number of page
faults, and you can't wait for it to complete at exec(). The result is
the same: large thp_fault_fallback.

So I can understand the need for both periodic and background compaction
(and direct compaction for non-thp non-atomic high-order allocations
today) and I'm perhaps not as convinced as you are that we can eventually
do without periodic compaction.


It seems to me that the vast majority of this discussion has centered
around the vehicle that performs the compaction. We certainly require
kcompactd for background compaction, and we both agree that we need that
functionality.

Two issues I want to bring up:

(1) do non-thp configs benefit from periodic compaction?

In my experience, no, but perhaps there are other use cases where
this has been a pain. The primary candidates, in my opinion,
would be the networking stack and slub. Joonsoo reports having to
workaround issues with high-order slub allocations being too
expensive. I'm not sure that would be better served by periodic
compaction, but it seems like a candidate for background compaction.

This is why my rfc tied periodic compaction to khugepaged, and we
have strong evidence that this helps thp and cpu utilization. For
periodic compaction to be possible outside of thp, we'd need a use
case for it.

(2) does kcompactd have to be per-node?

I don't see the immediate benefit since direct compaction can
already scan remote memory and migrate it, khugepaged can do the
same. Is there evidence that suggests that a per-node kcompactd
is significantly better than a single kthread? I think others
would be more receptive of a single kthread addition.

My theory is that periodic compaction is only significantly beneficial for
thp per my rfc, and I think there's a significant advantage for khugepaged
to be able to trigger this periodic compaction immediately before scanning
and allocating to avoid waiting potentially for the lengthy
alloc_sleep_millisecs. I don't see a problem with defining the period
with a khugepaged tunable for that reason.

For background compaction, which is more difficult, it would be simple to
implement a kcompactd to perform the memory compaction and actually be
triggered by khugepaged to do the compaction on its behalf and wait to
scan and allocate until it' complete. The vehicle will probably end up as
kcompactd doing the actual compaction is both cases.

But until we have a background compaction implementation, it seems like
there's no objection to doing and defining periodic compaction in
khugepaged as the rfc proposes? It seems like we can easily extend that
in the future once background compaction is available.

2015-07-24 05:28:55

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Thu, Jul 23, 2015 at 01:58:20PM -0700, David Rientjes wrote:
> On Thu, 23 Jul 2015, Joonsoo Kim wrote:
>
> > > The slub allocator does try to allocate its high-order memory with
> > > __GFP_WAIT before falling back to lower orders if possible. I would think
> > > that this would be the greatest sign of on-demand memory compaction being
> > > a problem, especially since CONFIG_SLUB is the default, but I haven't seen
> > > such reports.
> >
> > In fact, some of our product had trouble with slub's high order
> > allocation 5 months ago. At that time, compaction didn't make high order
> > page and compaction attempts are frequently deferred. It also causes many
> > reclaim to make high order page so I suggested masking out __GFP_WAIT
> > and adding __GFP_NO_KSWAPD when trying slub's high order allocation to
> > reduce reclaim/compaction overhead. Although using high order page in slub
> > has some gains that reducing internal fragmentation and reducing management
> > overhead, benefit is marginal compared to the cost at making high order
> > page. This solution improves system response time for our case. I planned
> > to submit the patch but it is delayed due to my laziness. :)
> >
>
> Hi Joonsoo,

Hello David.

>
> On a fragmented machine I can certainly understand that the overhead
> involved in allocating the high-order page outweighs the benefit later and
> it's better to fallback more quickly to page orders if the cache allows
> it.
>
> I believe that this would be improved by the suggestion of doing
> background synchronous compaction. So regardless of whether __GFP_WAIT is
> set, if the allocation fails then we can kick off background compaction
> that will hopefully defragment memory for future callers. That should
> make high-order atomic allocations more successful as well.

Yep! I also think __GFP_NO_KSWAPD isn't appropriate for general case.
Reason I suggested __GFP_NO_KSWAPD to our system is that reclaim/compaction
continually fails to make high order page so we don't want to invoke
reclaim/compaction even though it works in background. But, on almost of
other system, reclaim/compaction could succeed so adding __GFP_NO_KSWAPD
doens't make sense for general case.

Thanks.

2015-07-24 06:11:37

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Thu, Jul 23, 2015 at 02:21:29PM -0700, David Rientjes wrote:
> On Thu, 23 Jul 2015, Vlastimil Babka wrote:
>
> > > When a khugepaged allocation fails for a node, it could easily kick off
> > > background compaction on that node and revisit the range later, very
> > > similar to how we can kick off background compaction in the page allocator
> > > when async or sync_light compaction fails.
> >
> > The revisiting sounds rather complicated. Page allocator doesn't have to do that.
> >
>
> I'm referring to khugepaged having a hugepage allocation fail, the page
> allocator kicking off background compaction, and khugepaged rescanning the
> same memory for which the allocation failed later.
>
> > > The distinction I'm trying to draw is between "periodic" and "background"
> > > compaction. I think there're usecases for both and we shouldn't be
> > > limiting ourselves to one or the other.
> >
> > OK, I understand you think we can have both, and the periodic one would be in
> > khugepaged. My main concern is that if we do the periodic one in khugepaged,
> > people might oppose adding yet another one as kcompactd. I hope we agree that
> > khugepaged is not suitable for all the use cases of the background one.
> >
>
> Yes, absolutely. I agree that we need the ability to do background
> compaction without requiring CONFIG_TRANSPARENT_HUGEPAGE.
>
> > My secondary concern/opinion is that I would hope that the background compaction
> > would be good enough to remove the need for the periodic one. So I would try the
> > background one first. But I understand the periodic one is simpler to implement.
> > On the other hand, it's not as urgent if you can simulate it from userspace.
> > With the 15min period you use, there's likely not much overhead saved when
> > invoking it from within the kernel? Sure there wouldn't be the synchronization
> > with khugepaged activity, but I still wonder if wiating for up to 1 minute
> > before khugepaged wakes up can make much difference with the 15min period.
> > Hm, your cron job could also perhaps adjust the khugepaged sleep tunable when
> > compaction is done, which IIRC results in immediate wakeup.
> >
>
> There are certainly ways to do this from userspace, but the premise is
> that this issue, specifically for users of thp, is significant for
> everyone ;)
>
> The problem that I've encountered with a background-only approach is that
> it doesn't help when you exec a large process that wants to fault most of
> its text and thp immediately cannot be allocated. This can be a result of
> never having done any compaction at all other than from the page
> allocator, which terminates when a page of the given order is available.
> So on a fragmented machine, all memory faulted is shown in
> thp_fault_fallback and we rely on khugepaged to (slowly) fix this problem
> up for us. We have shown great improvement in cpu utilization by
> periodically compacting memory today.
>
> Background compaction arguably wouldn't help that situation because it's
> not fast enough to compact memory simultaneous to the large number of page
> faults, and you can't wait for it to complete at exec(). The result is
> the same: large thp_fault_fallback.
>
> So I can understand the need for both periodic and background compaction
> (and direct compaction for non-thp non-atomic high-order allocations
> today) and I'm perhaps not as convinced as you are that we can eventually
> do without periodic compaction.
>
>
> It seems to me that the vast majority of this discussion has centered
> around the vehicle that performs the compaction. We certainly require
> kcompactd for background compaction, and we both agree that we need that
> functionality.
>
> Two issues I want to bring up:
>
> (1) do non-thp configs benefit from periodic compaction?
>
> In my experience, no, but perhaps there are other use cases where
> this has been a pain. The primary candidates, in my opinion,
> would be the networking stack and slub. Joonsoo reports having to
> workaround issues with high-order slub allocations being too
> expensive. I'm not sure that would be better served by periodic
> compaction, but it seems like a candidate for background compaction.

In embedded world, there is another candidate, ION allocator. When launching
a new app, it try to allocate high order pages for graphic memory and
fallback to low order pages as following sequence (8, 4, 0). Success of it
affects system performance. It looks like similar case as THP. I guess
it can be also benefit from periodic compaction. Detailed explanation
about the problem is noted on Pintu's slide so please refer it for
further information.

http://events.linuxfoundation.org/sites/events/files/slides/
%5BELC-2015%5D-System-wide-Memory-Defragmenter.pdf

I think that supporting periodic compaction for other configs is way
to go.

Thanks.

>
> This is why my rfc tied periodic compaction to khugepaged, and we
> have strong evidence that this helps thp and cpu utilization. For
> periodic compaction to be possible outside of thp, we'd need a use
> case for it.
>
> (2) does kcompactd have to be per-node?
>
> I don't see the immediate benefit since direct compaction can
> already scan remote memory and migrate it, khugepaged can do the
> same. Is there evidence that suggests that a per-node kcompactd
> is significantly better than a single kthread? I think others
> would be more receptive of a single kthread addition.
>
> My theory is that periodic compaction is only significantly beneficial for
> thp per my rfc, and I think there's a significant advantage for khugepaged
> to be able to trigger this periodic compaction immediately before scanning
> and allocating to avoid waiting potentially for the lengthy
> alloc_sleep_millisecs. I don't see a problem with defining the period
> with a khugepaged tunable for that reason.
>
> For background compaction, which is more difficult, it would be simple to
> implement a kcompactd to perform the memory compaction and actually be
> triggered by khugepaged to do the compaction on its behalf and wait to
> scan and allocate until it' complete. The vehicle will probably end up as
> kcompactd doing the actual compaction is both cases.
>
> But until we have a background compaction implementation, it seems like
> there's no objection to doing and defining periodic compaction in
> khugepaged as the rfc proposes? It seems like we can easily extend that
> in the future once background compaction is available.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-07-24 06:45:43

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On 07/23/2015 11:21 PM, David Rientjes wrote:
> On Thu, 23 Jul 2015, Vlastimil Babka wrote:
>
>>> When a khugepaged allocation fails for a node, it could easily kick off
>>> background compaction on that node and revisit the range later, very
>>> similar to how we can kick off background compaction in the page allocator
>>> when async or sync_light compaction fails.
>>
>> The revisiting sounds rather complicated. Page allocator doesn't have to do that.
>>
>
> I'm referring to khugepaged having a hugepage allocation fail, the page
> allocator kicking off background compaction, and khugepaged rescanning the
> same memory for which the allocation failed later.

Ah, OK.

>>> The distinction I'm trying to draw is between "periodic" and "background"
>>> compaction. I think there're usecases for both and we shouldn't be
>>> limiting ourselves to one or the other.
>>
>> OK, I understand you think we can have both, and the periodic one would be in
>> khugepaged. My main concern is that if we do the periodic one in khugepaged,
>> people might oppose adding yet another one as kcompactd. I hope we agree that
>> khugepaged is not suitable for all the use cases of the background one.
>>
>
> Yes, absolutely. I agree that we need the ability to do background
> compaction without requiring CONFIG_TRANSPARENT_HUGEPAGE.

Great.

>> My secondary concern/opinion is that I would hope that the background compaction
>> would be good enough to remove the need for the periodic one. So I would try the
>> background one first. But I understand the periodic one is simpler to implement.
>> On the other hand, it's not as urgent if you can simulate it from userspace.
>> With the 15min period you use, there's likely not much overhead saved when
>> invoking it from within the kernel? Sure there wouldn't be the synchronization
>> with khugepaged activity, but I still wonder if wiating for up to 1 minute
>> before khugepaged wakes up can make much difference with the 15min period.
>> Hm, your cron job could also perhaps adjust the khugepaged sleep tunable when
>> compaction is done, which IIRC results in immediate wakeup.
>>
>
> There are certainly ways to do this from userspace, but the premise is
> that this issue, specifically for users of thp, is significant for
> everyone ;)

Well, then the default shouldn't be disabled :)

> The problem that I've encountered with a background-only approach is that
> it doesn't help when you exec a large process that wants to fault most of
> its text and thp immediately cannot be allocated. This can be a result of
> never having done any compaction at all other than from the page
> allocator, which terminates when a page of the given order is available.
> So on a fragmented machine, all memory faulted is shown in
> thp_fault_fallback and we rely on khugepaged to (slowly) fix this problem
> up for us. We have shown great improvement in cpu utilization by
> periodically compacting memory today.
>
> Background compaction arguably wouldn't help that situation because it's
> not fast enough to compact memory simultaneous to the large number of page
> faults, and you can't wait for it to complete at exec(). The result is
> the same: large thp_fault_fallback.

You probably presume that background compaction (e.g. kcompactd) is only
triggered by high-order allocator failures. That would be indeed too
late. But the plan is that it would be more proactive and e.g. wake up
periodically to check the fragmentation / availability of high-order pages.

Also in the scenario of suddenly executing a large process, isn't it
that the most of memory not occupied by other processes is already
occupied by page cache anyway, so you would have at most high-watermark
worth of compacted free memory? If the process is larger than that, it
would still go to direct reclaim/compaction regardless of periodic
compaction...

> So I can understand the need for both periodic and background compaction
> (and direct compaction for non-thp non-atomic high-order allocations
> today) and I'm perhaps not as convinced as you are that we can eventually
> do without periodic compaction.
>
>
> It seems to me that the vast majority of this discussion has centered
> around the vehicle that performs the compaction. We certainly require
> kcompactd for background compaction, and we both agree that we need that
> functionality.
>
> Two issues I want to bring up:
>
> (1) do non-thp configs benefit from periodic compaction?
>
> In my experience, no, but perhaps there are other use cases where
> this has been a pain. The primary candidates, in my opinion,
> would be the networking stack and slub. Joonsoo reports having to
> workaround issues with high-order slub allocations being too
> expensive. I'm not sure that would be better served by periodic
> compaction, but it seems like a candidate for background compaction.

Yes hopefully a proactive background compaction would serve them enough.

> This is why my rfc tied periodic compaction to khugepaged, and we
> have strong evidence that this helps thp and cpu utilization. For
> periodic compaction to be possible outside of thp, we'd need a use
> case for it.
>
> (2) does kcompactd have to be per-node?
>
> I don't see the immediate benefit since direct compaction can
> already scan remote memory and migrate it, khugepaged can do the

It can work remotely, but it's slower.

> same. Is there evidence that suggests that a per-node kcompactd
> is significantly better than a single kthread? I think others
> would be more receptive of a single kthread addition.

I think it's simpler design wrt waking up the kthread for the desired
node, and self-tuning any sleeping depending on per-node pressure. It
also matches the design of kswapd. And IMHO machines with many memory
nodes should naturally have also many CPU's to cope with the threads, so
it should all scale well.

>
> My theory is that periodic compaction is only significantly beneficial for
> thp per my rfc, and I think there's a significant advantage for khugepaged
> to be able to trigger this periodic compaction immediately before scanning
> and allocating to avoid waiting potentially for the lengthy
> alloc_sleep_millisecs. I don't see a problem with defining the period
> with a khugepaged tunable for that reason.
>
> For background compaction, which is more difficult, it would be simple to
> implement a kcompactd to perform the memory compaction and actually be
> triggered by khugepaged to do the compaction on its behalf and wait to
> scan and allocate until it' complete. The vehicle will probably end up as
> kcompactd doing the actual compaction is both cases.
>
> But until we have a background compaction implementation, it seems like
> there's no objection to doing and defining periodic compaction in
> khugepaged as the rfc proposes? It seems like we can easily extend that
> in the future once background compaction is available.

Besides the vehicle details which are possible to change in the future,
the larger issue IMHO is introducing new tunables and then having to
commit to them forever. So we should be careful there.

2015-07-24 14:22:27

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC v2 0/4] Outsourcing compaction for THP allocations to kcompactd

On 07/02/2015 04:46 AM, Vlastimil Babka wrote:
> This RFC series is another evolution of the attempt to deal with THP
> allocations latencies. Please see the motivation in the previous version [1]
>
> The main difference here is that I've bitten the bullet and implemented
> per-node kcompactd kthreads - see Patch 1 for the details of why and how.
> Trying to fit everything into khugepaged was getting too clumsy, and kcompactd
> could have more benefits, see e.g. the ideas here [2]. Not everything is
> implemented yet, though, I would welcome some feedback first.

This leads to a few questions, one of which has an obvious answer.

1) Why should this functionality not be folded into kswapd?

(because kswapd can get stuck on IO for long periods of time)

2) Given that kswapd can get stuck on IO for long periods of
time, are there other tasks we may want to break out of
kswapd, in order to reduce page reclaim latencies for things
like network allocations?

(freeing clean inactive pages?)

2015-07-27 09:30:37

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC v2 0/4] Outsourcing compaction for THP allocations to kcompactd

On 07/24/2015 04:22 PM, Rik van Riel wrote:
> On 07/02/2015 04:46 AM, Vlastimil Babka wrote:
>> This RFC series is another evolution of the attempt to deal with THP
>> allocations latencies. Please see the motivation in the previous version [1]
>>
>> The main difference here is that I've bitten the bullet and implemented
>> per-node kcompactd kthreads - see Patch 1 for the details of why and how.
>> Trying to fit everything into khugepaged was getting too clumsy, and kcompactd
>> could have more benefits, see e.g. the ideas here [2]. Not everything is
>> implemented yet, though, I would welcome some feedback first.
>
> This leads to a few questions, one of which has an obvious answer.
>
> 1) Why should this functionality not be folded into kswapd?
>
> (because kswapd can get stuck on IO for long periods of time)

Hm, my main concern was somewhat opposite - kswapd primarily serves to
avoid direct reclaim (also for) order-0 allocations, so we don't want to
make it busy compacting for high-order allocations and then fail to
reclaim quickly enough.
Also the waking up of kswapd for all the distinct tasks would become
more complex.

Also does kswapd really get stuck on IO? Doesn't it just issue writeback
and go on? Again it would be the opposite concern, as sync compaction
may have to wait for writeback before migrating a page and blocking
kswapd on that wouldn't be nice.

> 2) Given that kswapd can get stuck on IO for long periods of
> time, are there other tasks we may want to break out of
> kswapd, in order to reduce page reclaim latencies for things
> like network allocations?
>
> (freeing clean inactive pages?)
>

2015-07-29 00:33:22

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Fri, 24 Jul 2015, Vlastimil Babka wrote:

> > Two issues I want to bring up:
> >
> > (1) do non-thp configs benefit from periodic compaction?
> >
> > In my experience, no, but perhaps there are other use cases where
> > this has been a pain. The primary candidates, in my opinion,
> > would be the networking stack and slub. Joonsoo reports having to
> > workaround issues with high-order slub allocations being too
> > expensive. I'm not sure that would be better served by periodic
> > compaction, but it seems like a candidate for background compaction.
>
> Yes hopefully a proactive background compaction would serve them enough.
>
> > This is why my rfc tied periodic compaction to khugepaged, and we
> > have strong evidence that this helps thp and cpu utilization. For
> > periodic compaction to be possible outside of thp, we'd need a use
> > case for it.
> >
> > (2) does kcompactd have to be per-node?
> >
> > I don't see the immediate benefit since direct compaction can
> > already scan remote memory and migrate it, khugepaged can do the
>
> It can work remotely, but it's slower.
>
> > same. Is there evidence that suggests that a per-node kcompactd
> > is significantly better than a single kthread? I think others
> > would be more receptive of a single kthread addition.
>
> I think it's simpler design wrt waking up the kthread for the desired node,
> and self-tuning any sleeping depending on per-node pressure. It also matches
> the design of kswapd. And IMHO machines with many memory nodes should
> naturally have also many CPU's to cope with the threads, so it should all
> scale well.
>

I see your "proactive background compaction" as my "periodic compaction"
:) And I agree with your comment that we should be careful about defining
the API so it can be easily extended in the future.

I see the two mechanisms different enough that they need to be defined
separately: periodic compaction that would be done at certain intervals
regardless of fragmentation or allocation failures to keep fragmentation
low, and background compaction that would be done when a zone reaches a
certain fragmentation index for high orders, similar to extfrag_threshold,
or an allocation failure.

Per-node kcompactd threads we agree would be optimal, so let's try to see
if we can make that work.

What do you think about the following?

- add vm.compact_period_secs to define the number of seconds between
full compactions on each node. This compaction would reset the
pageblock skip heuristic and be synchronous. It would default to 900
based only on our evidence that 15m period compaction helps increase
our cpu utilization for khugepaged; it is arbitrary and I'd happily
change it if someone has a better suggestion. Changing it to 0 would
disable periodic compaction (we don't anticipate anybody will ever
want kcompactd threads will take 100% of cpu on each node). We can
stagger this over all nodes to avoid all kcompactd threads working at
the same time.

- add vm.compact_background_extfrag_threshold to define the extfrag
threshold when kcompactd should start doing sync_light migration
in the background without resetting the pageblock skip heuristic.
The threshold is defined at PAGE_ALLOC_COSTLY_ORDER and is halved
for each order higher so that very high order allocations don't
trigger it. To reduce overhead, this can be checked only in the
slowpath.

I'd also like to talk about compacting of mlocked memory and limit it to
only periodic compaction so that we aren't constantly incurring minor
faults when not expected.

How does this sound?

2015-07-29 06:34:14

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On 07/29/2015 02:33 AM, David Rientjes wrote:
> On Fri, 24 Jul 2015, Vlastimil Babka wrote:
>
>> > Two issues I want to bring up:
>> >
>> > (1) do non-thp configs benefit from periodic compaction?
>> >
>> > In my experience, no, but perhaps there are other use cases where
>> > this has been a pain. The primary candidates, in my opinion,
>> > would be the networking stack and slub. Joonsoo reports having to
>> > workaround issues with high-order slub allocations being too
>> > expensive. I'm not sure that would be better served by periodic
>> > compaction, but it seems like a candidate for background compaction.
>>
>> Yes hopefully a proactive background compaction would serve them enough.
>>
>> > This is why my rfc tied periodic compaction to khugepaged, and we
>> > have strong evidence that this helps thp and cpu utilization. For
>> > periodic compaction to be possible outside of thp, we'd need a use
>> > case for it.
>> >
>> > (2) does kcompactd have to be per-node?
>> >
>> > I don't see the immediate benefit since direct compaction can
>> > already scan remote memory and migrate it, khugepaged can do the
>>
>> It can work remotely, but it's slower.
>>
>> > same. Is there evidence that suggests that a per-node kcompactd
>> > is significantly better than a single kthread? I think others
>> > would be more receptive of a single kthread addition.
>>
>> I think it's simpler design wrt waking up the kthread for the desired node,
>> and self-tuning any sleeping depending on per-node pressure. It also matches
>> the design of kswapd. And IMHO machines with many memory nodes should
>> naturally have also many CPU's to cope with the threads, so it should all
>> scale well.
>>
>
> I see your "proactive background compaction" as my "periodic compaction"
> :) And I agree with your comment that we should be careful about defining
> the API so it can be easily extended in the future.
>
> I see the two mechanisms different enough that they need to be defined
> separately: periodic compaction that would be done at certain intervals
> regardless of fragmentation or allocation failures to keep fragmentation
> low, and background compaction that would be done when a zone reaches a
> certain fragmentation index for high orders, similar to extfrag_threshold,
> or an allocation failure.

Is there a smart way to check the fragmentation index without doing it just
periodically, and without polluting the allocator fast paths?

Do you think we should still handle THP availability separately as this patchset
does, or not? I think it could still serve to reduce page fault latencies and
pointless khugepaged scanning when hugepages cannot be allocated.
Which implies, can the following be built on top of this patchset?

> Per-node kcompactd threads we agree would be optimal, so let's try to see
> if we can make that work.
>
> What do you think about the following?
>
> - add vm.compact_period_secs to define the number of seconds between
> full compactions on each node. This compaction would reset the
> pageblock skip heuristic and be synchronous. It would default to 900
> based only on our evidence that 15m period compaction helps increase
> our cpu utilization for khugepaged; it is arbitrary and I'd happily
> change it if someone has a better suggestion. Changing it to 0 would
> disable periodic compaction (we don't anticipate anybody will ever
> want kcompactd threads will take 100% of cpu on each node). We can
> stagger this over all nodes to avoid all kcompactd threads working at
> the same time.

I guess more testing would be useful to see that it still improves things over
the background compaction?

> - add vm.compact_background_extfrag_threshold to define the extfrag
> threshold when kcompactd should start doing sync_light migration
> in the background without resetting the pageblock skip heuristic.
> The threshold is defined at PAGE_ALLOC_COSTLY_ORDER and is halved
> for each order higher so that very high order allocations don't

I've pondered what exactly the fragmentation index calculates, and it's hard to
imagine how I'd set the threshold. Note that the equation already does
effectively a halving with each order increase, but probably in the opposite
direction that you want it to.

Michal Hocko suggested to me offline that we have tunables like
compact_min_order and compact_max_order, where (IIUC) compaction would trigger
when no pages of >=compact_min_order are available, and then compaction would
stop when pages of >=compact_max_order are available (i.e. a kind of
hysteresis). I'm not sure about this either, as the user would have to know
which order-allocations his particular drivers need (unless it's somehow
self-tuning).

What I have instead in mind is something like the current high-order watermark
checking (which may be going away soon, but anyway...) basically for each order
we say how many pages of "at least that order" should be available. This could
be calculated progressively for all orders from a single tunable and size of
zone. Or maybe two tunables meant as min/max, to triggest start and end of
background compaction.

> trigger it. To reduce overhead, this can be checked only in the
> slowpath.

Hmm slowpath might be too late, but could be usable starting point.

> I'd also like to talk about compacting of mlocked memory and limit it to
> only periodic compaction so that we aren't constantly incurring minor
> faults when not expected.

Well, periodic compaction can be "expected" in the sense that period is known,
but how would the knowledge help the applications suffering from the minor faults?

> How does this sound?
>

2015-07-29 21:54:23

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Wed, 29 Jul 2015, Vlastimil Babka wrote:

> > I see the two mechanisms different enough that they need to be defined
> > separately: periodic compaction that would be done at certain intervals
> > regardless of fragmentation or allocation failures to keep fragmentation
> > low, and background compaction that would be done when a zone reaches a
> > certain fragmentation index for high orders, similar to extfrag_threshold,
> > or an allocation failure.
>
> Is there a smart way to check the fragmentation index without doing it just
> periodically, and without polluting the allocator fast paths?
>

I struggled with that one and that led to my suggestion about checking the
need for background compaction in the slowpath for high-order allocations,
much like kicking kswapd.

We certainly don't want to add fastpath overhead for this in the page
allocator nor in any compound page constructor.

The downside to doing it only in the slowpath, of course, is that the
allocation has to initially fail. I don't see that as being problematic,
though, because there's a good chance that the initial MIGRATE_ASYNC
direct compaction will be successful: I think we can easily check the
fragmentation index here and then kick off background compaction if
needed.

We can try to be clever not only about triggering background compaction at
certain thresholds, but also how much compaction we want kcompactd to do
in the background given the threshold. I wouldn't try to fine tune those
heuristics in an initially implementation, though

> Do you think we should still handle THP availability separately as this patchset
> does, or not? I think it could still serve to reduce page fault latencies and
> pointless khugepaged scanning when hugepages cannot be allocated.
> Which implies, can the following be built on top of this patchset?
>

Seems like a different feature compared to implementing periodic and
background compaction. And since better compaction should yield better
allocation success for khugepaged, it would probably make sense to
evaluate the need for it after periodic and background compaction have
been tried, it would add more weight to the justification.

I can certainly implement periodic compaction similar to the rfc but using
per-node kcompactd threads and background compaction, and fold your
kcompactd introduction into the patchset as a first step. I was trying to
see if there were any concerns about the proposal first. I think it
covers Joonsoo's usecase.

> > > > What do you think about the following?
> >
> > - add vm.compact_period_secs to define the number of seconds between
> > full compactions on each node. This compaction would reset the
> > pageblock skip heuristic and be synchronous. It would default to 900
> > based only on our evidence that 15m period compaction helps increase
> > our cpu utilization for khugepaged; it is arbitrary and I'd happily
> > change it if someone has a better suggestion. Changing it to 0 would
> > disable periodic compaction (we don't anticipate anybody will ever
> > want kcompactd threads will take 100% of cpu on each node). We can
> > stagger this over all nodes to avoid all kcompactd threads working at
> > the same time.
>
> I guess more testing would be useful to see that it still improves things over
> the background compaction?
>
> > - add vm.compact_background_extfrag_threshold to define the extfrag
> > threshold when kcompactd should start doing sync_light migration
> > in the background without resetting the pageblock skip heuristic.
> > The threshold is defined at PAGE_ALLOC_COSTLY_ORDER and is halved
> > for each order higher so that very high order allocations don't
>
> I've pondered what exactly the fragmentation index calculates, and it's hard to
> imagine how I'd set the threshold. Note that the equation already does
> effectively a halving with each order increase, but probably in the opposite
> direction that you want it to.
>

I think we'd want to start with the default extfrag_threshold to determine
whether compacting in the background would be worthwhile.

> What I have instead in mind is something like the current high-order watermark
> checking (which may be going away soon, but anyway...) basically for each order
> we say how many pages of "at least that order" should be available. This could
> be calculated progressively for all orders from a single tunable and size of
> zone. Or maybe two tunables meant as min/max, to triggest start and end of
> background compaction.
>

We will want to kick off background compaction in the slowpath immediately
jst like kswapd for the given order. That background compaction should
continue until the fragmentation index meets
vm.compact_background_extfrag_threshold for that order, defaulting to
the global extfrag_threshold. Logically, that would make sense since
otherwise compaction would be skipped for that zone anyway. But it is
inverted as you mentioned.

> > I'd also like to talk about compacting of mlocked memory and limit it to
> > only periodic compaction so that we aren't constantly incurring minor
> > faults when not expected.
>
> Well, periodic compaction can be "expected" in the sense that period is known,
> but how would the knowledge help the applications suffering from the minor faults?
>

There's been a lot of debate over the years of whether compaction should
be able to migrate mlocked memory. We have done it for years and recently
upstream has moved in the same direction. Since direct compaction is able
to do it, I don't anticipate a problem with periodic compaction doing so,
especially at our setting of 15m. If that were substantially lower,
however, I could imagine it would have an affect due to increased minor
faults. That means we should either limit direct compaction to a sane
minimum or it will become more complex and we must re-add unevictable vs
evictable behavior back into the migration scanner. Or, the best option,
we say if you really want to periodically compact so frequently that you
accept the tradeoffs and leave it to the admin :)

2015-07-29 23:57:30

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Wed, Jul 29, 2015 at 08:34:06AM +0200, Vlastimil Babka wrote:
> On 07/29/2015 02:33 AM, David Rientjes wrote:
> > On Fri, 24 Jul 2015, Vlastimil Babka wrote:
> >
> >> > Two issues I want to bring up:
> >> >
> >> > (1) do non-thp configs benefit from periodic compaction?
> >> >
> >> > In my experience, no, but perhaps there are other use cases where
> >> > this has been a pain. The primary candidates, in my opinion,
> >> > would be the networking stack and slub. Joonsoo reports having to
> >> > workaround issues with high-order slub allocations being too
> >> > expensive. I'm not sure that would be better served by periodic
> >> > compaction, but it seems like a candidate for background compaction.
> >>
> >> Yes hopefully a proactive background compaction would serve them enough.
> >>
> >> > This is why my rfc tied periodic compaction to khugepaged, and we
> >> > have strong evidence that this helps thp and cpu utilization. For
> >> > periodic compaction to be possible outside of thp, we'd need a use
> >> > case for it.

Allowing us to use higher order pages in the page cache to support
filesystem block sizes larger than page size without having to
care about memory fragmentation preventing page cache allocation?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-30 10:58:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Thu, Jul 02, 2015 at 10:46:32AM +0200, Vlastimil Babka wrote:
> Memory compaction can be currently performed in several contexts:
>
> - kswapd balancing a zone after a high-order allocation failure

Which potentially was a problem that's hard to detect.

> - direct compaction to satisfy a high-order allocation, including THP page
> fault attemps
> - khugepaged trying to collapse a hugepage
> - manually from /proc
>
> The purpose of compaction is two-fold. The obvious purpose is to satisfy a
> (pending or future) high-order allocation, and is easy to evaluate. The other
> purpose is to keep overal memory fragmentation low and help the
> anti-fragmentation mechanism. The success wrt the latter purpose is more
> difficult to evaluate.
>

The latter would be very difficult to measure. It would have to be shown
that the compaction took movable pages from a pageblock assigned to
unmovable or reclaimable pages and that the action prevented a pageblock
being stolen. You'd have to track all allocation/frees and compaction
events and run it through a simulator. Even then, it'd prove/disprove it
in a single case.

The "obvious purpose" is sufficient justification IMO.

> The current situation wrt the purposes has a few drawbacks:
>
> - compaction is invoked only when a high-order page or hugepage is not
> available (or manually). This might be too late for the purposes of keeping
> memory fragmentation low.

Yep. The other side of the coin is that time can be spent compacting for
a non-existent user so there may be demand for tuning.

> - direct compaction increases latency of allocations. Again, it would be
> better if compaction was performed asynchronously to keep fragmentation low,
> before the allocation itself comes.

Definitely. Ideally direct compaction stalls would never occur unless the
caller absolutely requires it.

> - (a special case of the previous) the cost of compaction during THP page
> faults can easily offset the benefits of THP.
>
> To improve the situation, we need an equivalent of kswapd, but for compaction.
> E.g. a background thread which responds to fragmentation and the need for
> high-order allocations (including hugepages) somewhat proactively.
>
> One possibility is to extend the responsibilities of kswapd, which could
> however complicate its design too much. It should be better to let kswapd
> handle reclaim, as order-0 allocations are often more critical than high-order
> ones.
>

Agreed. Kswapd compacting can cause a direct reclaim stall for order-0. One
motivation for kswapd doing the compaction was for high-order atomic
allocation failures. At the risk of distracting from this series, the
requirement for high-order atomic allocations is better served by "Remove
zonelist cache and high-order watermark checking" than kswapd running
compaction. kcompactd would have a lot of value for both THP and allowing
high-atomic reserves to grow quickly if necessary

> Another possibility is to extend khugepaged, but this kthread is a single
> instance and tied to THP configs.
>
> This patch goes with the option of a new set of per-node kthreads called
> kcompactd, and lays the foundations. The lifecycle mimics kswapd kthreads.
>
> The work loop of kcompactd currently mimics an pageblock-order direct
> compaction attempt each 15 seconds. This might not be enough to keep
> fragmentation low, and needs evaluation.
>

You could choose to adapt the rate based on the number of high-order
requests that entered the slow path. Initially I would not try though,
keep it simple first.

> When there's not enough free memory for compaction, kswapd is woken up for
> reclaim only (not compaction/reclaim).
>
> Further patches will add the ability to wake up kcompactd on demand in special
> situations such as when hugepages are not available, or when a fragmentation
> event occured.
>
> Not-yet-signed-off-by: Vlastimil Babka <[email protected]>
> ---
> include/linux/compaction.h | 11 +++
> include/linux/mmzone.h | 4 ++
> mm/compaction.c | 173 +++++++++++++++++++++++++++++++++++++++++++++
> mm/memory_hotplug.c | 15 ++--
> mm/page_alloc.c | 3 +
> 5 files changed, 201 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index aa8f61c..a2525d8 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -51,6 +51,9 @@ extern void compaction_defer_reset(struct zone *zone, int order,
> bool alloc_success);
> extern bool compaction_restarting(struct zone *zone, int order);
>
> +extern int kcompactd_run(int nid);
> +extern void kcompactd_stop(int nid);
> +
> #else
> static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
> unsigned int order, int alloc_flags,
> @@ -83,6 +86,14 @@ static inline bool compaction_deferred(struct zone *zone, int order)
> return true;
> }
>
> +static int kcompactd_run(int nid)
> +{
> + return 0;
> +}
> +extern void kcompactd_stop(int nid)
> +{
> +}
> +
> #endif /* CONFIG_COMPACTION */
>
> #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 54d74f6..bc96a23 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -762,6 +762,10 @@ typedef struct pglist_data {
> /* Number of pages migrated during the rate limiting time interval */
> unsigned long numabalancing_migrate_nr_pages;
> #endif
> +#ifdef CONFIG_COMPACTION
> + struct task_struct *kcompactd;
> + wait_queue_head_t kcompactd_wait;
> +#endif
> } pg_data_t;
>
> #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 018f08d..fcbc093 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -17,6 +17,8 @@
> #include <linux/balloon_compaction.h>
> #include <linux/page-isolation.h>
> #include <linux/kasan.h>
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
> #include "internal.h"
>
> #ifdef CONFIG_COMPACTION
> @@ -29,6 +31,10 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
> {
> count_vm_events(item, delta);
> }
> +
> +//TODO: add tuning knob
> +static unsigned int kcompactd_sleep_millisecs __read_mostly = 15000;
> +
> #else
> #define count_compact_event(item) do { } while (0)
> #define count_compact_events(item, delta) do { } while (0)

Leave the tuning knob out unless it is absolutely required. kcompactd
may eventually decide that a time-based heuristic for wakeups is not
enough. Minimally add a mechanism that wakes kcompactd up in response to
allocation failures to how wakeup_kswapd gets called. An alternative
would be to continue waking kswapd as normal, have kswapd only reclaim
order-0 pages as part of the reclaim/compaction phase and wake kcompactd
when the reclaim phase is complete. If kswapd decides that no reclaim is
necessary then kcompactd gets woken immediately.

khugepaged would then kick kswapd through the normal mechanism and
potentially avoid direct compaction. On NUMA machines, it could keep
scanning to see if there is another node whose pages can be collapsed. On
UMA, it could just pause immediately and wait for kswapd and kcompactd to
do something useful.

There will be different opinions on periodic compaction but to be honest,
periodic compaction also could be implemented from userspace using the
compact_node sysfs files. The risk with periodic compaction is that it
can cause stalls in applications that do not care if they fault the pages
being migrated. This may happen even though there are zero requirements
for high-order pages from anybody.

> @@ -1714,4 +1720,171 @@ void compaction_unregister_node(struct node *node)
> }
> #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>
> +/*
> + * Has any special work been requested of kcompactd?
> + */
> +static bool kcompactd_work_requested(pg_data_t *pgdat)
> +{
> + return false;
> +}
> +
> +static void kcompactd_do_work(pg_data_t *pgdat)
> +{
> + /*
> + * //TODO: smarter decisions on how much to compact. Using pageblock
> + * order might result in no compaction, until fragmentation builds up
> + * too much. Using order -1 could be too aggressive on large zones.
> + *

You could consider using pgdat->kswapd_max_order? That thing is meant
to help kswapd decide what order is required by callers at the moment.
Again, kswapd does the order-0 reclaim and then wakes kcompactd with the
kswapd_max_order as a parameter.

> + * With no special task, compact all zones so that a pageblock-order
> + * page is allocatable. Wake up kswapd if there's not enough free
> + * memory for compaction.
> + */
> + int zoneid;
> + struct zone *zone;
> + struct compact_control cc = {
> + .order = pageblock_order,
> + .mode = MIGRATE_SYNC,
> + .ignore_skip_hint = true,
> + };
> +
> + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> +
> + int suitable;
> +
> + zone = &pgdat->node_zones[zoneid];
> + if (!populated_zone(zone))
> + continue;
> +
> + suitable = compaction_suitable(zone, cc.order, 0, 0);
> +
> + if (suitable == COMPACT_SKIPPED) {
> + /*
> + * We pass order==0 to kswapd so it doesn't compact by
> + * itself. We just need enough free pages to proceed
> + * with compaction here on next kcompactd wakeup.
> + */
> + wakeup_kswapd(zone, 0, 0);
> + continue;
> + }

I think it makes more sense that kswapd kicks kcompactd than the other
way around.

Overall, I like the idea.

--
Mel Gorman
SUSE Labs

2015-07-31 21:17:25

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC 1/4] mm, compaction: introduce kcompactd

On Thu, 30 Jul 2015, Mel Gorman wrote:

> There will be different opinions on periodic compaction but to be honest,
> periodic compaction also could be implemented from userspace using the
> compact_node sysfs files. The risk with periodic compaction is that it
> can cause stalls in applications that do not care if they fault the pages
> being migrated. This may happen even though there are zero requirements
> for high-order pages from anybody.
>

When thp is enabled, I think there is always a non-zero requirement for
high-order pages. That's why we've shown an increase of 1.4% in cpu
utilization over all our machines by doing periodic memory compaction.
It's essential when thp is enabled and no amount of background compaction
kicked off with a trigger similar to kswapd (which I have agreed with in
this thread) is going to assist when a very large process is exec'd.

That's why my proposal was for background compaction through kcompactd
kicked off in the allocator slowpath and for periodic compaction on, at
the minimum, thp configurations to keep fragmentation low. Dave Chinner
seems to also have a usecase absent thp for high-order page cache
allocation.

I think it would depend on how aggressive you are proposing background
compaction to be, whether it will ever be MIGRATE_SYNC over all memory, or
whether it will only terminate when a fragmentation index meets a
threshold.