LinuxLists.cc - Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

2016-03-02 06:33:08

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

On Mon, Feb 08, 2016 at 02:38:10PM +0100, Vlastimil Babka wrote:
> Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim and
> compaction to attempt making memory allocation of given order available. The
> details differ from direct reclaim e.g. in having high watermark as a goal.
> The code involved in kswapd's reclaim/compaction decisions has evolved to be
> quite complex. Testing reveals that it doesn't actually work in at least one
> scenario, and closer inspection suggests that it could be greatly simplified
> without compromising on the goal (make high-order page available) or efficiency
> (don't reclaim too much). The simplification relieas of doing all compaction in
> kcompactd, which is simply woken up when high watermarks are reached by
> kswapd's reclaim.
>
> The scenario where kswapd compaction doesn't work was found with mmtests test
> stress-highalloc configured to attempt order-9 allocations without direct
> reclaim, just waking up kswapd. There was no compaction attempt from kswapd
> during the whole test. Some added instrumentation shows what happens:
>
> - balance_pgdat() sets end_zone to Normal, as it's not balanced
> - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
> cannot reclaim anything, so sc.nr_reclaimed is 0
> - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
> merely checks if high watermarks were reached for base pages. This is true,
> so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
> compaction_suitable() returned COMPACT_SKIPPED
> - even though the pgdat_needs_compaction flag wasn't set to false, no
> compaction happens due to the condition sc.nr_reclaimed > nr_attempted
> being false (as 0 < 99)
> - priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
> pgdat_balanced() is false as only the small zone DMA appears balanced
> (curiously in that check, watermark appears OK and compaction_suitable()
> returns COMPACT_PARTIAL, because a lower classzone_idx is used there)
>
> Now, even if it was decided that reclaim shouldn't be attempted on the DMA
> zone, the scenario would be the same, as (sc.nr_reclaimed=0 > nr_attempted=0)
> is also false. The condition really should use >= as the comment suggests.
> Then there is a mismatch in the check for setting pgdat_needs_compaction to
> false using low watermark, while the rest uses high watermark, and who knows
> what other subtlety. Hopefully this demonstrates that this is unsustainable.
>
> Luckily we can simplify this a lot. The reclaim/compaction decisions make
> sense for direct reclaim scenario, but in kswapd, our primary goal is to reach
> high watermark in order-0 pages. Afterwards we can attempt compaction just
> once. Unlike direct reclaim, we don't reclaim extra pages (over the high
> watermark), the current code already disallows it for good reasons.
>
> After this patch, we simply wake up kcompactd to process the pgdat, after we
> have either succeeded or failed to reach the high watermarks in kswapd, which
> goes to sleep. We pass kswapd's order and classzone_idx, so kcompactd can apply
> the same criteria to determine which zones are worth compacting. Note that we
> use the classzone_idx from wakeup_kswapd(), not balanced_classzone_idx which
> can include higher zones that kswapd tried to balance too, but didn't consider
> them in pgdat_balanced().
>
> Since kswapd now cannot create high-order pages itself, we need to adjust how
> it determines the zones to be balanced. The key element here is adding a
> "highorder" parameter to zone_balanced, which, when set to false, makes it
> consider only order-0 watermark instead of the desired higher order (this was
> done previously by kswapd_shrink_zone(), but not elsewhere). This false is
> passed for example in pgdat_balanced(). Importantly, wakeup_kswapd() uses true
> to make sure kswapd and thus kcompactd are woken up for a high-order allocation
> failure.
>
> For testing, I used stress-highalloc configured to do order-9 allocations with
> GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on kswapd/kcompactd
> reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as
> usual):
>
> stress-highalloc
> 4.5-rc1 4.5-rc1
> 3-test 4-test
> Success 1 Min 1.00 ( 0.00%) 3.00 (-200.00%)
> Success 1 Mean 1.40 ( 0.00%) 4.00 (-185.71%)
> Success 1 Max 2.00 ( 0.00%) 6.00 (-200.00%)
> Success 2 Min 1.00 ( 0.00%) 3.00 (-200.00%)
> Success 2 Mean 1.80 ( 0.00%) 4.20 (-133.33%)
> Success 2 Max 3.00 ( 0.00%) 6.00 (-100.00%)
> Success 3 Min 34.00 ( 0.00%) 63.00 (-85.29%)
> Success 3 Mean 41.80 ( 0.00%) 64.60 (-54.55%)
> Success 3 Max 53.00 ( 0.00%) 67.00 (-26.42%)
>
> 4.5-rc1 4.5-rc1
> 3-test 4-test
> User 3166.67 3088.82
> System 1153.37 1142.01
> Elapsed 1768.53 1780.91
>
> 4.5-rc1 4.5-rc1
> 3-test 4-test
> Minor Faults 106940795 106582816
> Major Faults 829 813
> Swap Ins 482 311
> Swap Outs 6278 5598
> Allocation stalls 128 184
> DMA allocs 145 32
> DMA32 allocs 74646161 74843238
> Normal allocs 26090955 25886668
> Movable allocs 0 0
> Direct pages scanned 32938 31429
> Kswapd pages scanned 2183166 2185293
> Kswapd pages reclaimed 2152359 2134389
> Direct pages reclaimed 32735 31234
> Kswapd efficiency 98% 97%
> Kswapd velocity 1243.877 1228.666
> Direct efficiency 99% 99%
> Direct velocity 18.767 17.671
> Percentage direct scans 1% 1%
> Zone normal velocity 299.981 291.409
> Zone dma32 velocity 962.522 954.928
> Zone dma velocity 0.142 0.000
> Page writes by reclaim 6278.800 5598.600
> Page writes file 0 0
> Page writes anon 6278 5598
> Page reclaim immediate 93 96
> Sector Reads 4357114 4307161
> Sector Writes 11053628 11053091
> Page rescued immediate 0 0
> Slabs scanned 1592829 1555770
> Direct inode steals 1557 2025
> Kswapd inode steals 46056 45418
> Kswapd skipped wait 0 0
> THP fault alloc 579 614
> THP collapse alloc 304 324
> THP splits 0 0
> THP fault fallback 793 730
> THP collapse fail 11 14
> Compaction stalls 1013 959
> Compaction success 92 69
> Compaction failures 920 890
> Page migrate success 238457 662054
> Page migrate failure 23021 32846
> Compaction pages isolated 504695 1370326
> Compaction migrate scanned 661390 7025772
> Compaction free scanned 13476658 73302642
> Compaction cost 262 762
>
> After this patch we see improvements in allocation success rate (especially for
> phase 3) along with increased compaction activity. The compaction stalls
> (direct compaction) in the interfering kernel builds (probably THP's) also
> decreased somewhat to kcompactd activity, yet THP alloc successes improved a
> bit.

Why you did the test with THP? THP interferes result of main test so
it would be better not to enable it.

And, this patch increased compaction activity (10 times for migrate scanned)
may be due to resetting skip block information. Isn't is better to disable it
for this patch to work as similar as possible that kswapd does and re-enable it
on next patch? If something goes bad, it can simply be reverted.

Look like it is even not mentioned in the description.

>
> We can also configure stress-highalloc to perform both direct
> reclaim/compaction and wakeup kswapd/kcompactd, by using
> GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
>
> stress-highalloc
> 4.5-rc1 4.5-rc1
> 3-test2 4-test2
> Success 1 Min 4.00 ( 0.00%) 6.00 (-50.00%)
> Success 1 Mean 8.00 ( 0.00%) 8.40 ( -5.00%)
> Success 1 Max 12.00 ( 0.00%) 13.00 ( -8.33%)
> Success 2 Min 4.00 ( 0.00%) 6.00 (-50.00%)
> Success 2 Mean 8.20 ( 0.00%) 8.60 ( -4.88%)
> Success 2 Max 13.00 ( 0.00%) 12.00 ( 7.69%)
> Success 3 Min 75.00 ( 0.00%) 75.00 ( 0.00%)
> Success 3 Mean 75.60 ( 0.00%) 75.60 ( 0.00%)
> Success 3 Max 77.00 ( 0.00%) 76.00 ( 1.30%)
>
> 4.5-rc1 4.5-rc1
> 3-test2 4-test2
> User 3344.73 3258.62
> System 1194.24 1177.92
> Elapsed 1838.04 1837.02
>
> 4.5-rc1 4.5-rc1
> 3-test2 4-test2
> Minor Faults 111269736 109392253
> Major Faults 806 755
> Swap Ins 671 155
> Swap Outs 5390 5790
> Allocation stalls 4610 4562
> DMA allocs 250 34
> DMA32 allocs 78091501 76901680
> Normal allocs 27004414 26587089
> Movable allocs 0 0
> Direct pages scanned 125146 108854
> Kswapd pages scanned 2119757 2131589
> Kswapd pages reclaimed 2073183 2090937
> Direct pages reclaimed 124909 108699
> Kswapd efficiency 97% 98%
> Kswapd velocity 1161.027 1160.870
> Direct efficiency 99% 99%
> Direct velocity 68.545 59.283
> Percentage direct scans 5% 4%
> Zone normal velocity 296.678 294.389
> Zone dma32 velocity 932.841 925.764
> Zone dma velocity 0.053 0.000
> Page writes by reclaim 5392.000 5790.600
> Page writes file 1 0
> Page writes anon 5390 5790
> Page reclaim immediate 104 218
> Sector Reads 4350232 4376989
> Sector Writes 11126496 11102113
> Page rescued immediate 0 0
> Slabs scanned 1705294 1692486
> Direct inode steals 8700 16266
> Kswapd inode steals 36352 28364
> Kswapd skipped wait 0 0
> THP fault alloc 599 567
> THP collapse alloc 323 326
> THP splits 0 0
> THP fault fallback 806 805
> THP collapse fail 17 18
> Compaction stalls 2457 2070
> Compaction success 906 527
> Compaction failures 1551 1543
> Page migrate success 2031423 2423657
> Page migrate failure 32845 28790
> Compaction pages isolated 4129761 4916017
> Compaction migrate scanned 11996712 19370264
> Compaction free scanned 214970969 360662356
> Compaction cost 2271 2745
>
> Here, this patch doesn't change the success rate as direct compaction already
> tries what it can. There's however significant reduction in direct compaction
> stalls, made entirely of the successful stalls. This means the offload to
> kcompactd is working as expected, and direct compaction is reduced either due
> to detecting contention, or compaction deferred by kcompactd. In the previous
> version of this patchset there was some apparent reduction of success rate,
> but the changes in this version (such as using sync compaction only), new
> baseline kernel, and/or averaging results from 5 executions (my bet), made this
> go away.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/vmscan.c | 146 ++++++++++++++++++++----------------------------------------
> 1 file changed, 48 insertions(+), 98 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c67df4831565..b8478a737ef5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2951,18 +2951,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
> } while (memcg);
> }
>
> -static bool zone_balanced(struct zone *zone, int order,
> - unsigned long balance_gap, int classzone_idx)
> +static bool zone_balanced(struct zone *zone, int order, bool highorder,
> + unsigned long balance_gap, int classzone_idx)
> {
> - if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
> - balance_gap, classzone_idx))
> - return false;
> + unsigned long mark = high_wmark_pages(zone) + balance_gap;
>
> - if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
> - order, 0, classzone_idx) == COMPACT_SKIPPED)
> - return false;
> + /*
> + * When checking from pgdat_balanced(), kswapd should stop and sleep
> + * when it reaches the high order-0 watermark and let kcompactd take
> + * over. Other callers such as wakeup_kswapd() want to determine the
> + * true high-order watermark.
> + */
> + if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
> + mark += (1UL << order);
> + order = 0;
> + }
>
> - return true;
> + return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
> }
>
> /*
> @@ -3012,7 +3017,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
> continue;
> }
>
> - if (zone_balanced(zone, order, 0, i))
> + if (zone_balanced(zone, order, false, 0, i))
> balanced_pages += zone->managed_pages;
> else if (!order)
> return false;
> @@ -3066,8 +3071,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> */
> static bool kswapd_shrink_zone(struct zone *zone,
> int classzone_idx,
> - struct scan_control *sc,
> - unsigned long *nr_attempted)
> + struct scan_control *sc)
> {
> int testorder = sc->order;

You can remove testorder completely.

> unsigned long balance_gap;
> @@ -3077,17 +3081,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
> sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
>
> /*
> - * Kswapd reclaims only single pages with compaction enabled. Trying
> - * too hard to reclaim until contiguous free pages have become
> - * available can hurt performance by evicting too much useful data
> - * from memory. Do not reclaim more than needed for compaction.
> - */
> - if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
> - compaction_suitable(zone, sc->order, 0, classzone_idx)
> - != COMPACT_SKIPPED)
> - testorder = 0;
> -
> - /*
> * We put equal pressure on every zone, unless one zone has way too
> * many pages free already. The "too many pages" is defined as the
> * high wmark plus a "gap" where the gap is either the low
> @@ -3101,15 +3094,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
> * reclaim is necessary
> */
> lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
> - if (!lowmem_pressure && zone_balanced(zone, testorder,
> + if (!lowmem_pressure && zone_balanced(zone, testorder, false,
> balance_gap, classzone_idx))
> return true;
>
> shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
>
> - /* Account for the number of pages attempted to reclaim */
> - *nr_attempted += sc->nr_to_reclaim;
> -
> clear_bit(ZONE_WRITEBACK, &zone->flags);
>
> /*
> @@ -3119,7 +3109,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
> * waits.
> */
> if (zone_reclaimable(zone) &&
> - zone_balanced(zone, testorder, 0, classzone_idx)) {
> + zone_balanced(zone, testorder, false, 0, classzone_idx)) {
> clear_bit(ZONE_CONGESTED, &zone->flags);
> clear_bit(ZONE_DIRTY, &zone->flags);
> }
> @@ -3131,7 +3121,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
> * For kswapd, balance_pgdat() will work across all this node's zones until
> * they are all at high_wmark_pages(zone).
> *
> - * Returns the final order kswapd was reclaiming at
> + * Returns the highest zone idx kswapd was reclaiming at
> *
> * There is special handling here for zones which are full of pinned pages.
> * This can happen if the pages are all mlocked, or if they are all used by
> @@ -3148,8 +3138,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
> * interoperates with the page allocator fallback scheme to ensure that aging
> * of pages is balanced across the zones.
> */
> -static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> - int *classzone_idx)
> +static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> {
> int i;
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> @@ -3166,9 +3155,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> count_vm_event(PAGEOUTRUN);
>
> do {
> - unsigned long nr_attempted = 0;
> bool raise_priority = true;
> - bool pgdat_needs_compaction = (order > 0);
>
> sc.nr_reclaimed = 0;
>
> @@ -3203,7 +3190,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> break;
> }
>
> - if (!zone_balanced(zone, order, 0, 0)) {
> + if (!zone_balanced(zone, order, true, 0, 0)) {

Should we use highorder = true? We eventually skip to reclaim in the
kswapd_shrink_zone() when zone_balanced(,,false,,) is true.

Thanks.

> end_zone = i;
> break;
> } else {
> @@ -3219,24 +3206,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> if (i < 0)
> goto out;
>
> - for (i = 0; i <= end_zone; i++) {
> - struct zone *zone = pgdat->node_zones + i;
> -
> - if (!populated_zone(zone))
> - continue;
> -
> - /*
> - * If any zone is currently balanced then kswapd will
> - * not call compaction as it is expected that the
> - * necessary pages are already available.
> - */
> - if (pgdat_needs_compaction &&
> - zone_watermark_ok(zone, order,
> - low_wmark_pages(zone),
> - *classzone_idx, 0))
> - pgdat_needs_compaction = false;
> - }
> -
> /*
> * If we're getting trouble reclaiming, start doing writepage
> * even in laptop mode.
> @@ -3280,8 +3249,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> * that that high watermark would be met at 100%
> * efficiency.
> */
> - if (kswapd_shrink_zone(zone, end_zone,
> - &sc, &nr_attempted))
> + if (kswapd_shrink_zone(zone, end_zone, &sc))
> raise_priority = false;
> }
>
> @@ -3294,49 +3262,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> pfmemalloc_watermark_ok(pgdat))
> wake_up_all(&pgdat->pfmemalloc_wait);
>
> - /*
> - * Fragmentation may mean that the system cannot be rebalanced
> - * for high-order allocations in all zones. If twice the
> - * allocation size has been reclaimed and the zones are still
> - * not balanced then recheck the watermarks at order-0 to
> - * prevent kswapd reclaiming excessively. Assume that a
> - * process requested a high-order can direct reclaim/compact.
> - */
> - if (order && sc.nr_reclaimed >= 2UL << order)
> - order = sc.order = 0;
> -
> /* Check if kswapd should be suspending */
> if (try_to_freeze() || kthread_should_stop())
> break;
>
> /*
> - * Compact if necessary and kswapd is reclaiming at least the
> - * high watermark number of pages as requsted
> - */
> - if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)
> - compact_pgdat(pgdat, order);
> -
> - /*
> * Raise priority if scanning rate is too low or there was no
> * progress in reclaiming pages
> */
> if (raise_priority || !sc.nr_reclaimed)
> sc.priority--;
> } while (sc.priority >= 1 &&
> - !pgdat_balanced(pgdat, order, *classzone_idx));
> + !pgdat_balanced(pgdat, order, classzone_idx));
>
> out:
> /*
> - * Return the order we were reclaiming at so prepare_kswapd_sleep()
> - * makes a decision on the order we were last reclaiming at. However,
> - * if another caller entered the allocator slow path while kswapd
> - * was awake, order will remain at the higher level
> + * Return the highest zone idx we were reclaiming at so
> + * prepare_kswapd_sleep() makes the same decisions as here.
> */
> - *classzone_idx = end_zone;
> - return order;
> + return end_zone;
> }
>
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> +static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
> + int classzone_idx, int balanced_classzone_idx)
> {
> long remaining = 0;
> DEFINE_WAIT(wait);
> @@ -3347,7 +3295,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>
> /* Try to sleep for a short interval */
> - if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
> + if (prepare_kswapd_sleep(pgdat, order, remaining,
> + balanced_classzone_idx)) {
> remaining = schedule_timeout(HZ/10);
> finish_wait(&pgdat->kswapd_wait, &wait);
> prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> @@ -3357,7 +3306,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> * After a short sleep, check if it was a premature sleep. If not, then
> * go fully to sleep until explicitly woken up.
> */
> - if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
> + if (prepare_kswapd_sleep(pgdat, order, remaining,
> + balanced_classzone_idx)) {
> trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>
> /*
> @@ -3378,6 +3328,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> */
> reset_isolation_suitable(pgdat);
>
> + /*
> + * We have freed the memory, now we should compact it to make
> + * allocation of the requested order possible.
> + */
> + wakeup_kcompactd(pgdat, order, classzone_idx);
> +
> if (!kthread_should_stop())
> schedule();
>
> @@ -3407,7 +3363,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> static int kswapd(void *p)
> {
> unsigned long order, new_order;
> - unsigned balanced_order;
> int classzone_idx, new_classzone_idx;
> int balanced_classzone_idx;
> pg_data_t *pgdat = (pg_data_t*)p;
> @@ -3440,23 +3395,19 @@ static int kswapd(void *p)
> set_freezable();
>
> order = new_order = 0;
> - balanced_order = 0;
> classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> balanced_classzone_idx = classzone_idx;
> for ( ; ; ) {
> bool ret;
>
> /*
> - * If the last balance_pgdat was unsuccessful it's unlikely a
> - * new request of a similar or harder type will succeed soon
> - * so consider going to sleep on the basis we reclaimed at
> + * While we were reclaiming, there might have been another
> + * wakeup, so check the values.
> */
> - if (balanced_order == new_order) {
> - new_order = pgdat->kswapd_max_order;
> - new_classzone_idx = pgdat->classzone_idx;
> - pgdat->kswapd_max_order = 0;
> - pgdat->classzone_idx = pgdat->nr_zones - 1;
> - }
> + new_order = pgdat->kswapd_max_order;
> + new_classzone_idx = pgdat->classzone_idx;
> + pgdat->kswapd_max_order = 0;
> + pgdat->classzone_idx = pgdat->nr_zones - 1;
>
> if (order < new_order || classzone_idx > new_classzone_idx) {
> /*
> @@ -3466,7 +3417,7 @@ static int kswapd(void *p)
> order = new_order;
> classzone_idx = new_classzone_idx;
> } else {
> - kswapd_try_to_sleep(pgdat, balanced_order,
> + kswapd_try_to_sleep(pgdat, order, classzone_idx,
> balanced_classzone_idx);
> order = pgdat->kswapd_max_order;
> classzone_idx = pgdat->classzone_idx;
> @@ -3486,9 +3437,8 @@ static int kswapd(void *p)
> */
> if (!ret) {
> trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> - balanced_classzone_idx = classzone_idx;
> - balanced_order = balance_pgdat(pgdat, order,
> - &balanced_classzone_idx);
> + balanced_classzone_idx = balance_pgdat(pgdat, order,
> + classzone_idx);
> }
> }
>
> @@ -3518,7 +3468,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
> }
> if (!waitqueue_active(&pgdat->kswapd_wait))
> return;
> - if (zone_balanced(zone, order, 0, 0))
> + if (zone_balanced(zone, order, true, 0, 0))
> return;
>
> trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
> --
> 2.7.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-02 10:05:11

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

On 03/02/2016 07:33 AM, Joonsoo Kim wrote:
>>
>> 4.5-rc1 4.5-rc1
>> 3-test 4-test
>> Minor Faults 106940795 106582816
>> Major Faults 829 813
>> Swap Ins 482 311
>> Swap Outs 6278 5598
>> Allocation stalls 128 184
>> DMA allocs 145 32
>> DMA32 allocs 74646161 74843238
>> Normal allocs 26090955 25886668
>> Movable allocs 0 0
>> Direct pages scanned 32938 31429
>> Kswapd pages scanned 2183166 2185293
>> Kswapd pages reclaimed 2152359 2134389
>> Direct pages reclaimed 32735 31234
>> Kswapd efficiency 98% 97%
>> Kswapd velocity 1243.877 1228.666
>> Direct efficiency 99% 99%
>> Direct velocity 18.767 17.671
>> Percentage direct scans 1% 1%
>> Zone normal velocity 299.981 291.409
>> Zone dma32 velocity 962.522 954.928
>> Zone dma velocity 0.142 0.000
>> Page writes by reclaim 6278.800 5598.600
>> Page writes file 0 0
>> Page writes anon 6278 5598
>> Page reclaim immediate 93 96
>> Sector Reads 4357114 4307161
>> Sector Writes 11053628 11053091
>> Page rescued immediate 0 0
>> Slabs scanned 1592829 1555770
>> Direct inode steals 1557 2025
>> Kswapd inode steals 46056 45418
>> Kswapd skipped wait 0 0
>> THP fault alloc 579 614
>> THP collapse alloc 304 324
>> THP splits 0 0
>> THP fault fallback 793 730
>> THP collapse fail 11 14
>> Compaction stalls 1013 959
>> Compaction success 92 69
>> Compaction failures 920 890
>> Page migrate success 238457 662054
>> Page migrate failure 23021 32846
>> Compaction pages isolated 504695 1370326
>> Compaction migrate scanned 661390 7025772
>> Compaction free scanned 13476658 73302642
>> Compaction cost 262 762
>>
>> After this patch we see improvements in allocation success rate (especially for
>> phase 3) along with increased compaction activity. The compaction stalls
>> (direct compaction) in the interfering kernel builds (probably THP's) also
>> decreased somewhat to kcompactd activity, yet THP alloc successes improved a
>> bit.
>
> Why you did the test with THP? THP interferes result of main test so
> it would be better not to enable it.

Hmm I've always left it enabled. It makes for a more realistic
interference and would also show unintended regressions in that closely
related area.

> And, this patch increased compaction activity (10 times for migrate scanned)
> may be due to resetting skip block information.

Note that kswapd compaction activity was completely non-existent for
reasons outlined in the changelog.

> Isn't is better to disable it
> for this patch to work as similar as possible that kswapd does and re-enable it
> on next patch? If something goes bad, it can simply be reverted.
>
> Look like it is even not mentioned in the description.

Yeah skip block information is discussed in the next patch, which
mentions that it's being reset and why. I think it makes more sense, as
when kswapd reclaims from low watermark to high, potentially many
pageblocks have new free pages and the skip bits are obsolete. Next,
kcompactd is separate thread, so it doesn't stall allocations (or kswapd
reclaim) by its activity.
Personally I hope that one day we can get rid of the skip bits
completely. They can make the stats look apparently nicer, but I think
their effect is nearly random.

>> @@ -3066,8 +3071,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>> */
>> static bool kswapd_shrink_zone(struct zone *zone,
>> int classzone_idx,
>> - struct scan_control *sc,
>> - unsigned long *nr_attempted)
>> + struct scan_control *sc)
>> {
>> int testorder = sc->order;
>
> You can remove testorder completely.

Hm right, thanks.

>> -static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>> - int *classzone_idx)
>> +static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>> {
>> int i;
>> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
>> @@ -3166,9 +3155,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>> count_vm_event(PAGEOUTRUN);
>>
>> do {
>> - unsigned long nr_attempted = 0;
>> bool raise_priority = true;
>> - bool pgdat_needs_compaction = (order > 0);
>>
>> sc.nr_reclaimed = 0;
>>
>> @@ -3203,7 +3190,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>> break;
>> }
>>
>> - if (!zone_balanced(zone, order, 0, 0)) {
>> + if (!zone_balanced(zone, order, true, 0, 0)) {
>
> Should we use highorder = true? We eventually skip to reclaim in the
> kswapd_shrink_zone() when zone_balanced(,,false,,) is true.

Hmm right. I probably thought that the value of end_zone ->
balanced_classzone_idx would be important when waking kcompactd, but
it's not used, so it's causing just some wasted CPU cycles.

Thanks for the reviews!

2016-03-02 13:57:53

by Joonsoo Kim

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

2016-03-02 19:04 GMT+09:00 Vlastimil Babka <[email protected]>:
> On 03/02/2016 07:33 AM, Joonsoo Kim wrote:
>>>
>>>
>>> 4.5-rc1 4.5-rc1
>>> 3-test 4-test
>>> Minor Faults 106940795 106582816
>>> Major Faults 829 813
>>> Swap Ins 482 311
>>> Swap Outs 6278 5598
>>> Allocation stalls 128 184
>>> DMA allocs 145 32
>>> DMA32 allocs 74646161 74843238
>>> Normal allocs 26090955 25886668
>>> Movable allocs 0 0
>>> Direct pages scanned 32938 31429
>>> Kswapd pages scanned 2183166 2185293
>>> Kswapd pages reclaimed 2152359 2134389
>>> Direct pages reclaimed 32735 31234
>>> Kswapd efficiency 98% 97%
>>> Kswapd velocity 1243.877 1228.666
>>> Direct efficiency 99% 99%
>>> Direct velocity 18.767 17.671
>>> Percentage direct scans 1% 1%
>>> Zone normal velocity 299.981 291.409
>>> Zone dma32 velocity 962.522 954.928
>>> Zone dma velocity 0.142 0.000
>>> Page writes by reclaim 6278.800 5598.600
>>> Page writes file 0 0
>>> Page writes anon 6278 5598
>>> Page reclaim immediate 93 96
>>> Sector Reads 4357114 4307161
>>> Sector Writes 11053628 11053091
>>> Page rescued immediate 0 0
>>> Slabs scanned 1592829 1555770
>>> Direct inode steals 1557 2025
>>> Kswapd inode steals 46056 45418
>>> Kswapd skipped wait 0 0
>>> THP fault alloc 579 614
>>> THP collapse alloc 304 324
>>> THP splits 0 0
>>> THP fault fallback 793 730
>>> THP collapse fail 11 14
>>> Compaction stalls 1013 959
>>> Compaction success 92 69
>>> Compaction failures 920 890
>>> Page migrate success 238457 662054
>>> Page migrate failure 23021 32846
>>> Compaction pages isolated 504695 1370326
>>> Compaction migrate scanned 661390 7025772
>>> Compaction free scanned 13476658 73302642
>>> Compaction cost 262 762
>>>
>>> After this patch we see improvements in allocation success rate
>>> (especially for
>>> phase 3) along with increased compaction activity. The compaction stalls
>>> (direct compaction) in the interfering kernel builds (probably THP's)
>>> also
>>> decreased somewhat to kcompactd activity, yet THP alloc successes
>>> improved a
>>> bit.
>>
>>
>> Why you did the test with THP? THP interferes result of main test so
>> it would be better not to enable it.
>
>
> Hmm I've always left it enabled. It makes for a more realistic interference
> and would also show unintended regressions in that closely related area.

But, it makes review hard because complex analysis is needed to
understand the result.

Following is the example.

"The compaction stalls
(direct compaction) in the interfering kernel builds (probably THP's) also
decreased somewhat to kcompactd activity, yet THP alloc successes improved a
bit."

So, why do we need this comment to understand effect of this patch? If you did
a test without THP, it would not be necessary.

>> And, this patch increased compaction activity (10 times for migrate
>> scanned)
>> may be due to resetting skip block information.
>
>
> Note that kswapd compaction activity was completely non-existent for reasons
> outlined in the changelog.
>> Isn't is better to disable it
>> for this patch to work as similar as possible that kswapd does and
>> re-enable it
>> on next patch? If something goes bad, it can simply be reverted.
>>
>> Look like it is even not mentioned in the description.
>
>
> Yeah skip block information is discussed in the next patch, which mentions
> that it's being reset and why. I think it makes more sense, as when kswapd

Yes, I know.
What I'd like to say here is that you need to care current_is_kswapd() in
this patch. This patch unintentionally change the back ground compaction thread
behaviour to restart compaction by every 64 trials because calling
curret_is_kswapd()
by kcompactd would return false and is treated as direct reclaim.
Result of patch 4
and patch 5 would be same.

Thanks.

> reclaims from low watermark to high, potentially many pageblocks have new
> free pages and the skip bits are obsolete. Next, kcompactd is separate
> thread, so it doesn't stall allocations (or kswapd reclaim) by its activity.
> Personally I hope that one day we can get rid of the skip bits completely.
> They can make the stats look apparently nicer, but I think their effect is
> nearly random.
>
>>> @@ -3066,8 +3071,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat,
>>> int order, long remaining,
>>> */
>>> static bool kswapd_shrink_zone(struct zone *zone,
>>> int classzone_idx,
>>> - struct scan_control *sc,
>>> - unsigned long *nr_attempted)
>>> + struct scan_control *sc)
>>> {
>>> int testorder = sc->order;
>>
>>
>> You can remove testorder completely.
>
>
> Hm right, thanks.
>
>>> -static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>> - int
>>> *classzone_idx)
>>> +static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>>> {
>>> int i;
>>> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
>>> @@ -3166,9 +3155,7 @@ static unsigned long balance_pgdat(pg_data_t
>>> *pgdat, int order,
>>> count_vm_event(PAGEOUTRUN);
>>>
>>> do {
>>> - unsigned long nr_attempted = 0;
>>> bool raise_priority = true;
>>> - bool pgdat_needs_compaction = (order > 0);
>>>
>>> sc.nr_reclaimed = 0;
>>>
>>> @@ -3203,7 +3190,7 @@ static unsigned long balance_pgdat(pg_data_t
>>> *pgdat, int order,
>>> break;
>>> }
>>>
>>> - if (!zone_balanced(zone, order, 0, 0)) {
>>> + if (!zone_balanced(zone, order, true, 0, 0)) {
>>
>>
>> Should we use highorder = true? We eventually skip to reclaim in the
>> kswapd_shrink_zone() when zone_balanced(,,false,,) is true.
>
>
> Hmm right. I probably thought that the value of end_zone ->
> balanced_classzone_idx would be important when waking kcompactd, but it's
> not used, so it's causing just some wasted CPU cycles.
>
> Thanks for the reviews!
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-02 14:09:38

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

On 03/02/2016 02:57 PM, Joonsoo Kim wrote:
> 2016-03-02 19:04 GMT+09:00 Vlastimil Babka <[email protected]>:
>> On 03/02/2016 07:33 AM, Joonsoo Kim wrote:
>>>
>>>
>>> Why you did the test with THP? THP interferes result of main test so
>>> it would be better not to enable it.
>>
>>
>> Hmm I've always left it enabled. It makes for a more realistic interference
>> and would also show unintended regressions in that closely related area.
>
> But, it makes review hard because complex analysis is needed to
> understand the result.
>
> Following is the example.
>
> "The compaction stalls
> (direct compaction) in the interfering kernel builds (probably THP's) also
> decreased somewhat to kcompactd activity, yet THP alloc successes improved a
> bit."
>
> So, why do we need this comment to understand effect of this patch? If you did
> a test without THP, it would not be necessary.

I see. Next time I'll do a run with THP disabled.

>>> And, this patch increased compaction activity (10 times for migrate
>>> scanned)
>>> may be due to resetting skip block information.
>>
>>
>> Note that kswapd compaction activity was completely non-existent for reasons
>> outlined in the changelog.
>>> Isn't is better to disable it
>>> for this patch to work as similar as possible that kswapd does and
>>> re-enable it
>>> on next patch? If something goes bad, it can simply be reverted.
>>>
>>> Look like it is even not mentioned in the description.
>>
>>
>> Yeah skip block information is discussed in the next patch, which mentions
>> that it's being reset and why. I think it makes more sense, as when kswapd
>
> Yes, I know.
> What I'd like to say here is that you need to care current_is_kswapd() in
> this patch. This patch unintentionally change the back ground compaction thread
> behaviour to restart compaction by every 64 trials because calling
> curret_is_kswapd()
> by kcompactd would return false and is treated as direct reclaim.

Oh, you mean this path to reset the skip bits. I see. But if skip bits
are already reset by kswapd when waking kcompactd, then effect of
another (rare) reset in kcompactd itself will be minimal?

> Result of patch 4
> and patch 5 would be same.

It's certainly possible to fold patch 5 into 4. I posted them separately
mainly to make review more feasible. But the differences in results are
already quite small.

2016-03-02 14:22:14

by Joonsoo Kim

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

2016-03-02 23:09 GMT+09:00 Vlastimil Babka <[email protected]>:
> On 03/02/2016 02:57 PM, Joonsoo Kim wrote:
>>
>> 2016-03-02 19:04 GMT+09:00 Vlastimil Babka <[email protected]>:
>>>
>>> On 03/02/2016 07:33 AM, Joonsoo Kim wrote:
>>>>
>>>>
>>>>
>>>> Why you did the test with THP? THP interferes result of main test so
>>>> it would be better not to enable it.
>>>
>>>
>>>
>>> Hmm I've always left it enabled. It makes for a more realistic
>>> interference
>>> and would also show unintended regressions in that closely related area.
>>
>>
>> But, it makes review hard because complex analysis is needed to
>> understand the result.
>>
>> Following is the example.
>>
>> "The compaction stalls
>> (direct compaction) in the interfering kernel builds (probably THP's) also
>> decreased somewhat to kcompactd activity, yet THP alloc successes improved
>> a
>> bit."
>>
>> So, why do we need this comment to understand effect of this patch? If you
>> did
>> a test without THP, it would not be necessary.
>
>
> I see. Next time I'll do a run with THP disabled.
>
>>>> And, this patch increased compaction activity (10 times for migrate
>>>> scanned)
>>>> may be due to resetting skip block information.
>>>
>>>
>>>
>>> Note that kswapd compaction activity was completely non-existent for
>>> reasons
>>> outlined in the changelog.
>>>>
>>>> Isn't is better to disable it
>>>> for this patch to work as similar as possible that kswapd does and
>>>> re-enable it
>>>> on next patch? If something goes bad, it can simply be reverted.
>>>>
>>>> Look like it is even not mentioned in the description.
>>>
>>>
>>>
>>> Yeah skip block information is discussed in the next patch, which
>>> mentions
>>> that it's being reset and why. I think it makes more sense, as when
>>> kswapd
>>
>>
>> Yes, I know.
>> What I'd like to say here is that you need to care current_is_kswapd() in
>> this patch. This patch unintentionally change the back ground compaction
>> thread
>> behaviour to restart compaction by every 64 trials because calling
>> curret_is_kswapd()
>
>> by kcompactd would return false and is treated as direct reclaim.
>
> Oh, you mean this path to reset the skip bits. I see. But if skip bits are
> already reset by kswapd when waking kcompactd, then effect of another (rare)
> reset in kcompactd itself will be minimal?

If you care current_is_kswapd() in this patch properly (properly means change
like "current_is_kcompactd()), reset in kswapd would not
happen because, compact_blockskip_flush would not be set by kcompactd.

In this case, patch 5 would have it's own meaning so cannot be folded.

Thanks.

>> Result of patch 4
>> and patch 5 would be same.
>
>
> It's certainly possible to fold patch 5 into 4. I posted them separately
> mainly to make review more feasible. But the differences in results are
> already quite small.
>

2016-03-02 14:41:00

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
> 2016-03-02 23:09 GMT+09:00 Vlastimil Babka <[email protected]>:
>> On 03/02/2016 02:57 PM, Joonsoo Kim wrote:
>>>
>>>
>>> Yes, I know.
>>> What I'd like to say here is that you need to care current_is_kswapd() in
>>> this patch. This patch unintentionally change the back ground compaction
>>> thread
>>> behaviour to restart compaction by every 64 trials because calling
>>> curret_is_kswapd()
>>
>>> by kcompactd would return false and is treated as direct reclaim.
>>
>> Oh, you mean this path to reset the skip bits. I see. But if skip bits are
>> already reset by kswapd when waking kcompactd, then effect of another (rare)
>> reset in kcompactd itself will be minimal?
>
> If you care current_is_kswapd() in this patch properly (properly means change
> like "current_is_kcompactd()), reset in kswapd would not
> happen because, compact_blockskip_flush would not be set by kcompactd.
>
> In this case, patch 5 would have it's own meaning so cannot be folded.

So I understand that patch 5 would be just about this?

- if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
+ if (compaction_restarting(zone, cc->order))
__reset_isolation_suitable(zone);

I'm more inclined to fold it in that case.

> Thanks.
>
>>> Result of patch 4
>>> and patch 5 would be same.
>>
>>
>> It's certainly possible to fold patch 5 into 4. I posted them separately
>> mainly to make review more feasible. But the differences in results are
>> already quite small.
>>

2016-03-02 14:59:11

by Joonsoo Kim

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

2016-03-02 23:40 GMT+09:00 Vlastimil Babka <[email protected]>:
> On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
>> 2016-03-02 23:09 GMT+09:00 Vlastimil Babka <[email protected]>:
>>> On 03/02/2016 02:57 PM, Joonsoo Kim wrote:
>>>>
>>>>
>>>> Yes, I know.
>>>> What I'd like to say here is that you need to care current_is_kswapd() in
>>>> this patch. This patch unintentionally change the back ground compaction
>>>> thread
>>>> behaviour to restart compaction by every 64 trials because calling
>>>> curret_is_kswapd()
>>>
>>>> by kcompactd would return false and is treated as direct reclaim.
>>>
>>> Oh, you mean this path to reset the skip bits. I see. But if skip bits are
>>> already reset by kswapd when waking kcompactd, then effect of another (rare)
>>> reset in kcompactd itself will be minimal?
>>
>> If you care current_is_kswapd() in this patch properly (properly means change
>> like "current_is_kcompactd()), reset in kswapd would not
>> happen because, compact_blockskip_flush would not be set by kcompactd.
>>
>> In this case, patch 5 would have it's own meaning so cannot be folded.
>
> So I understand that patch 5 would be just about this?
>
> - if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
> + if (compaction_restarting(zone, cc->order))
> __reset_isolation_suitable(zone);

Yeah, you understand correctly. :)

> I'm more inclined to fold it in that case.

Patch would be just simple, but, I guess it would cause some difference
in test result. But, I'm okay for folding.

Thanks.

2016-03-02 15:22:47

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

On 03/02/2016 03:59 PM, Joonsoo Kim wrote:
> 2016-03-02 23:40 GMT+09:00 Vlastimil Babka <[email protected]>:
>> On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
>>
>> So I understand that patch 5 would be just about this?
>>
>> - if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
>> + if (compaction_restarting(zone, cc->order))
>> __reset_isolation_suitable(zone);
>
> Yeah, you understand correctly. :)
>
>> I'm more inclined to fold it in that case.
>
> Patch would be just simple, but, I guess it would cause some difference
> in test result. But, I'm okay for folding.

Thanks. Andrew, should I send now patch folding patch 4/5 and 5/5 with
all the accumulated fixlets (including those I sent earlier today) and
combined changelog, or do you want to apply the new fixlets separately
first and let them sit for a week or so? In any case, sorry for the churn.

> Thanks.
>

2016-03-04 23:25:10

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

On Wed, 2 Mar 2016 16:22:43 +0100 Vlastimil Babka <[email protected]> wrote:

> On 03/02/2016 03:59 PM, Joonsoo Kim wrote:
> > 2016-03-02 23:40 GMT+09:00 Vlastimil Babka <[email protected]>:
> >> On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
> >>
> >> So I understand that patch 5 would be just about this?
> >>
> >> - if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
> >> + if (compaction_restarting(zone, cc->order))
> >> __reset_isolation_suitable(zone);
> >
> > Yeah, you understand correctly. :)
> >
> >> I'm more inclined to fold it in that case.
> >
> > Patch would be just simple, but, I guess it would cause some difference
> > in test result. But, I'm okay for folding.
>
> Thanks. Andrew, should I send now patch folding patch 4/5 and 5/5 with
> all the accumulated fixlets (including those I sent earlier today) and
> combined changelog, or do you want to apply the new fixlets separately
> first and let them sit for a week or so? In any case, sorry for the churn.

Did I get everything?

http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-remove-bogus-check-of-balance_classzone_idx.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-2.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-3.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-memory-hotplug-small-cleanup-in-online_pages.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-adapt-isolation_suitable-flushing-to-kcompactd.patch

2016-03-07 09:45:46

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

2016-03-09 13:47:37

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

On 03/05/2016 12:25 AM, Andrew Morton wrote:
>> Thanks. Andrew, should I send now patch folding patch 4/5 and 5/5 with
>> all the accumulated fixlets (including those I sent earlier today) and
>> combined changelog, or do you want to apply the new fixlets separately
>> first and let them sit for a week or so? In any case, sorry for the churn.
>
> Did I get everything?
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-remove-bogus-check-of-balance_classzone_idx.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-2.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-3.patch

Please add the one below here.

> http://ozlabs.org/~akpm/mmots/broken-out/mm-memory-hotplug-small-cleanup-in-online_pages.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-adapt-isolation_suitable-flushing-to-kcompactd.patch

----8<----
>From 0977a031f891ef6f675a64c53b797d92d839f11c Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Thu, 3 Mar 2016 11:35:23 +0100
Subject: [PATCH 1/3] mm-compaction-introduce-kcompactd-fix-4

Fix typo in /proc/vmstat for kcompactd wakeups. Per Hugh's suggestion,
rename the item to compact_daemon_wake.

Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/vmstat.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c9571294f61c..f80066248c94 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -826,7 +826,7 @@ const char * const vmstat_text[] = {
"compact_stall",
"compact_fail",
"compact_success",
- "compact_kcompatd_wake",
+ "compact_daemon_wake",
#endif

#ifdef CONFIG_HUGETLB_PAGE
--
2.7.2

2016-03-09 13:52:15

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd

On 03/05/2016 12:25 AM, Andrew Morton wrote:
> On Wed, 2 Mar 2016 16:22:43 +0100 Vlastimil Babka <[email protected]> wrote:
>
>> On 03/02/2016 03:59 PM, Joonsoo Kim wrote:
>>> 2016-03-02 23:40 GMT+09:00 Vlastimil Babka <[email protected]>:
>>>> On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
>>>>
>>>> So I understand that patch 5 would be just about this?
>>>>
>>>> - if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
>>>> + if (compaction_restarting(zone, cc->order))
>>>> __reset_isolation_suitable(zone);
>>>
>>> Yeah, you understand correctly. :)
>>>
>>>> I'm more inclined to fold it in that case.
>>>
>>> Patch would be just simple, but, I guess it would cause some difference
>>> in test result. But, I'm okay for folding.
>>
>> Thanks. Andrew, should I send now patch folding patch 4/5 and 5/5 with
>> all the accumulated fixlets (including those I sent earlier today) and
>> combined changelog, or do you want to apply the new fixlets separately
>> first and let them sit for a week or so? In any case, sorry for the churn.
>
> Did I get everything?
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-remove-bogus-check-of-balance_classzone_idx.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-2.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-3.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-memory-hotplug-small-cleanup-in-online_pages.patch

Please replace the following three:

> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-adapt-isolation_suitable-flushing-to-kcompactd.patch

With the squashed one below (had to mangle the changelog nontrivially).
This is after discussion with Joonsoo. It was perhaps better separately for
review, but functionality-wise the first patch leaves things somewhat
weird without the third patch.

----8<----
>From c829909527ecd33eb869c96bcd287bade2b32100 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Wed, 9 Mar 2016 12:45:24 +0100
Subject: [PATCH 3/3] mm, kswapd: replace kswapd compaction with waking up
kcompactd

Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim
and compaction to attempt making memory allocation of given order
available. The details differ from direct reclaim e.g. in having high
watermark as a goal. The code involved in kswapd's reclaim/compaction
decisions has evolved to be quite complex. Testing reveals that it
doesn't actually work in at least one scenario, and closer inspection
suggests that it could be greatly simplified without compromising on the
goal (make high-order page available) or efficiency (don't reclaim too
much). The simplification relieas of doing all compaction in kcompactd,
which is simply woken up when high watermarks are reached by kswapd's
reclaim.

The scenario where kswapd compaction doesn't work was found with mmtests
test stress-highalloc configured to attempt order-9 allocations without
direct reclaim, just waking up kswapd. There was no compaction attempt
from kswapd during the whole test. Some added instrumentation shows what
happens:

- balance_pgdat() sets end_zone to Normal, as it's not balanced
- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
cannot reclaim anything, so sc.nr_reclaimed is 0
- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
merely checks if high watermarks were reached for base pages. This is true,
so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
compaction_suitable() returned COMPACT_SKIPPED
- even though the pgdat_needs_compaction flag wasn't set to false, no
compaction happens due to the condition sc.nr_reclaimed > nr_attempted
being false (as 0 < 99)
- priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
pgdat_balanced() is false as only the small zone DMA appears balanced
(curiously in that check, watermark appears OK and compaction_suitable()
returns COMPACT_PARTIAL, because a lower classzone_idx is used there)

Now, even if it was decided that reclaim shouldn't be attempted on the DMA
zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
nr_attempted=0) is also false. The condition really should use >= as the
comment suggests. Then there is a mismatch in the check for setting
pgdat_needs_compaction to false using low watermark, while the rest uses
high watermark, and who knows what other subtlety. Hopefully this
demonstrates that this is unsustainable.

Luckily we can simplify this a lot. The reclaim/compaction decisions make
sense for direct reclaim scenario, but in kswapd, our primary goal is to
reach high watermark in order-0 pages. Afterwards we can attempt
compaction just once. Unlike direct reclaim, we don't reclaim extra pages
(over the high watermark), the current code already disallows it for good
reasons.

After this patch, we simply wake up kcompactd to process the pgdat, after
we have either succeeded or failed to reach the high watermarks in kswapd,
which goes to sleep. We pass kswapd's order and classzone_idx, so
kcompactd can apply the same criteria to determine which zones are worth
compacting. Note that we use the classzone_idx from wakeup_kswapd(), not
balanced_classzone_idx which can include higher zones that kswapd tried to
balance too, but didn't consider them in pgdat_balanced().

Since kswapd now cannot create high-order pages itself, we need to adjust
how it determines the zones to be balanced. The key element here is
adding a "highorder" parameter to zone_balanced, which, when set to false,
makes it consider only order-0 watermark instead of the desired higher
order (this was done previously by kswapd_shrink_zone(), but not
elsewhere). This false is passed for example in pgdat_balanced().
Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
kcompactd are woken up for a high-order allocation failure.

The last thing is to decide what to do with pageblock_skip bitmap handling.
Compaction maintains a pageblock_skip bitmap to record pageblocks where
isolation recently failed. This bitmap can be reset by three ways:

1) direct compaction is restarting after going through the full deferred cycle

2) kswapd goes to sleep, and some other direct compaction has previously
finished scanning the whole zone and set zone->compact_blockskip_flush.
Note that a successful direct compaction clears this flag.

3) compaction was invoked manually via trigger in /proc

The case 2) is somewhat fuzzy to begin with, but after introducing
kcompactd we should update it. The check for direct compaction in 1), and
to set the flush flag in 2) use current_is_kswapd(), which doesn't work
for kcompactd. Thus, this patch adds bool direct_compaction to
compact_control to use in 2). For the case 1) we remove the check
completely - unlike the former kswapd compaction, kcompactd does use the
deferred compaction functionality, so flushing tied to restarting from
deferred compaction makes sense here.

Note that when kswapd goes to sleep, kcompactd is woken up, so it will see
the flushed pageblock_skip bits. This is different from when the former
kswapd compaction observed the bits and I believe it makes more sense.
Kcompactd can afford to be more thorough than a direct compaction trying
to limit allocation latency, or kswapd whose primary goal is to reclaim.

For testing, I used stress-highalloc configured to do order-9 allocations
with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on
kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
phases 1 and 2 work as usual):

stress-highalloc
4.5-rc1+before 4.5-rc1+after
-nodirect -nodirect
Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%)
Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%)
Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%)
Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%)
Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%)
Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%)
Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%)
Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%)
Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%)

User 3166.67 3181.09
System 1153.37 1158.25
Elapsed 1768.53 1799.37

4.5-rc1+before 4.5-rc1+after
-nodirect -nodirect
Direct pages scanned 32938 32797
Kswapd pages scanned 2183166 2202613
Kswapd pages reclaimed 2152359 2143524
Direct pages reclaimed 32735 32545
Percentage direct scans 1% 1%
THP fault alloc 579 612
THP collapse alloc 304 316
THP splits 0 0
THP fault fallback 793 778
THP collapse fail 11 16
Compaction stalls 1013 1007
Compaction success 92 67
Compaction failures 920 939
Page migrate success 238457 721374
Page migrate failure 23021 23469
Compaction pages isolated 504695 1479924
Compaction migrate scanned 661390 8812554
Compaction free scanned 13476658 84327916
Compaction cost 262 838

After this patch we see improvements in allocation success rate
(especially for phase 3) along with increased compaction activity. The
compaction stalls (direct compaction) in the interfering kernel builds
(probably THP's) also decreased somewhat thanks to kcompactd activity, yet
THP alloc successes improved a bit.

Note that elapsed and user time isn't so useful for this benchmark,
because of the background interference being unpredictable. It's just to
quickly spot some major unexpected differences. System time is somewhat
more useful and that didn't increase.

Also (after adjusting mmtests' ftrace monitor):

Time kswapd awake 2547781 2269241
Time kcompactd awake 0 119253
Time direct compacting 939937 557649
Time kswapd compacting 0 0
Time kcompactd compacting 0 119099

The decrease of overal time spent compacting appears to not match the
increased compaction stats. I suspect the tasks get rescheduled and since
the ftrace monitor doesn't see that, the reported time is wall time, not
CPU time. But arguably direct compactors care about overall latency
anyway, whether busy compacting or waiting for CPU doesn't matter. And
that latency seems to almost halved.

It's also interesting how much time kswapd spent awake just going through
all the priorities and failing to even try compacting, over and over.

We can also configure stress-highalloc to perform both direct
reclaim/compaction and wakeup kswapd/kcompactd, by using
GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

stress-highalloc
4.5-rc1+before 4.5-rc1+after
-direct -direct
Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%)
Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%)
Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%)
Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%)
Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%)
Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%)
Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%)
Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%)
Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%)

User 3344.73 3246.04
System 1194.24 1172.29
Elapsed 1838.04 1836.76

4.5-rc1+before 4.5-rc1+after
-direct -direct
Direct pages scanned 125146 120966
Kswapd pages scanned 2119757 2135012
Kswapd pages reclaimed 2073183 2108388
Direct pages reclaimed 124909 120577
Percentage direct scans 5% 5%
THP fault alloc 599 652
THP collapse alloc 323 354
THP splits 0 0
THP fault fallback 806 793
THP collapse fail 17 16
Compaction stalls 2457 2025
Compaction success 906 518
Compaction failures 1551 1507
Page migrate success 2031423 2360608
Page migrate failure 32845 40852
Compaction pages isolated 4129761 4802025
Compaction migrate scanned 11996712 21750613
Compaction free scanned 214970969 344372001
Compaction cost 2271 2694

In this scenario, this patch doesn't change the overall success rate as
direct compaction already tries all it can. There's however significant
reduction in direct compaction stalls (that is, the number of allocations
that went into direct compaction). The number of successes (i.e. direct
compaction stalls that ended up with successful allocation) is reduced by
the same number. This means the offload to kcompactd is working as
expected, and direct compaction is reduced either due to detecting
contention, or compaction deferred by kcompactd. In the previous version
of this patchset there was some apparent reduction of success rate, but
the changes in this version (such as using sync compaction only), new
baseline kernel, and/or averaging results from 5 executions (my bet), made
this go away.

Ftrace-based stats seem to roughly agree:

Time kswapd awake 2532984 2326824
Time kcompactd awake 0 257916
Time direct compacting 864839 735130
Time kswapd compacting 0 0
Time kcompactd compacting 0 257585

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/compaction.c | 10 ++--
mm/internal.h | 1 +
mm/vmscan.c | 147 ++++++++++++++++++--------------------------------------
3 files changed, 54 insertions(+), 104 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 5b2bfbaa821a..ccf97b02b85f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1191,11 +1191,11 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,

/*
* Mark that the PG_migrate_skip information should be cleared
- * by kswapd when it goes to sleep. kswapd does not set the
+ * by kswapd when it goes to sleep. kcompactd does not set the
* flag itself as the decision to be clear should be directly
* based on an allocation request.
*/
- if (!current_is_kswapd())
+ if (cc->direct_compaction)
zone->compact_blockskip_flush = true;

return COMPACT_COMPLETE;
@@ -1338,10 +1338,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

/*
* Clear pageblock skip if there were failures recently and compaction
- * is about to be retried after being deferred. kswapd does not do
- * this reset as it'll reset the cached information when going to sleep.
+ * is about to be retried after being deferred.
*/
- if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
+ if (compaction_restarting(zone, cc->order))
__reset_isolation_suitable(zone);

/*
@@ -1477,6 +1476,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
.mode = mode,
.alloc_flags = alloc_flags,
.classzone_idx = classzone_idx,
+ .direct_compaction = true,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
diff --git a/mm/internal.h b/mm/internal.h
index 17ae0b52534b..013a786fa37f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -181,6 +181,7 @@ struct compact_control {
unsigned long last_migrated_pfn;/* Not yet flushed page being freed */
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
+ bool direct_compaction; /* False from kcompactd or /proc/... */
int order; /* order a direct compactor needs */
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
const int alloc_flags; /* alloc flags of a direct compactor */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c67df4831565..23bc7e643ad8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2951,18 +2951,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
} while (memcg);
}

-static bool zone_balanced(struct zone *zone, int order,
- unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order, bool highorder,
+ unsigned long balance_gap, int classzone_idx)
{
- if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
- balance_gap, classzone_idx))
- return false;
+ unsigned long mark = high_wmark_pages(zone) + balance_gap;

- if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
- order, 0, classzone_idx) == COMPACT_SKIPPED)
- return false;
+ /*
+ * When checking from pgdat_balanced(), kswapd should stop and sleep
+ * when it reaches the high order-0 watermark and let kcompactd take
+ * over. Other callers such as wakeup_kswapd() want to determine the
+ * true high-order watermark.
+ */
+ if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
+ mark += (1UL << order);
+ order = 0;
+ }

- return true;
+ return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
}

/*
@@ -3012,7 +3017,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
continue;
}

- if (zone_balanced(zone, order, 0, i))
+ if (zone_balanced(zone, order, false, 0, i))
balanced_pages += zone->managed_pages;
else if (!order)
return false;
@@ -3066,10 +3071,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
*/
static bool kswapd_shrink_zone(struct zone *zone,
int classzone_idx,
- struct scan_control *sc,
- unsigned long *nr_attempted)
+ struct scan_control *sc)
{
- int testorder = sc->order;
unsigned long balance_gap;
bool lowmem_pressure;

@@ -3077,17 +3080,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));

/*
- * Kswapd reclaims only single pages with compaction enabled. Trying
- * too hard to reclaim until contiguous free pages have become
- * available can hurt performance by evicting too much useful data
- * from memory. Do not reclaim more than needed for compaction.
- */
- if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
- compaction_suitable(zone, sc->order, 0, classzone_idx)
- != COMPACT_SKIPPED)
- testorder = 0;
-
- /*
* We put equal pressure on every zone, unless one zone has way too
* many pages free already. The "too many pages" is defined as the
* high wmark plus a "gap" where the gap is either the low
@@ -3101,15 +3093,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
* reclaim is necessary
*/
lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
- if (!lowmem_pressure && zone_balanced(zone, testorder,
+ if (!lowmem_pressure && zone_balanced(zone, sc->order, false,
balance_gap, classzone_idx))
return true;

shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);

- /* Account for the number of pages attempted to reclaim */
- *nr_attempted += sc->nr_to_reclaim;
-
clear_bit(ZONE_WRITEBACK, &zone->flags);

/*
@@ -3119,7 +3108,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
* waits.
*/
if (zone_reclaimable(zone) &&
- zone_balanced(zone, testorder, 0, classzone_idx)) {
+ zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
clear_bit(ZONE_CONGESTED, &zone->flags);
clear_bit(ZONE_DIRTY, &zone->flags);
}
@@ -3131,7 +3120,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at high_wmark_pages(zone).
*
- * Returns the final order kswapd was reclaiming at
+ * Returns the highest zone idx kswapd was reclaiming at
*
* There is special handling here for zones which are full of pinned pages.
* This can happen if the pages are all mlocked, or if they are all used by
@@ -3148,8 +3137,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
* interoperates with the page allocator fallback scheme to ensure that aging
* of pages is balanced across the zones.
*/
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
- int *classzone_idx)
+static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
{
int i;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
@@ -3166,9 +3154,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
count_vm_event(PAGEOUTRUN);

do {
- unsigned long nr_attempted = 0;
bool raise_priority = true;
- bool pgdat_needs_compaction = (order > 0);

sc.nr_reclaimed = 0;

@@ -3203,7 +3189,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
break;
}

- if (!zone_balanced(zone, order, 0, 0)) {
+ if (!zone_balanced(zone, order, false, 0, 0)) {
end_zone = i;
break;
} else {
@@ -3219,24 +3205,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
if (i < 0)
goto out;

- for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- if (!populated_zone(zone))
- continue;
-
- /*
- * If any zone is currently balanced then kswapd will
- * not call compaction as it is expected that the
- * necessary pages are already available.
- */
- if (pgdat_needs_compaction &&
- zone_watermark_ok(zone, order,
- low_wmark_pages(zone),
- *classzone_idx, 0))
- pgdat_needs_compaction = false;
- }
-
/*
* If we're getting trouble reclaiming, start doing writepage
* even in laptop mode.
@@ -3280,8 +3248,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
* that that high watermark would be met at 100%
* efficiency.
*/
- if (kswapd_shrink_zone(zone, end_zone,
- &sc, &nr_attempted))
+ if (kswapd_shrink_zone(zone, end_zone, &sc))
raise_priority = false;
}

@@ -3294,49 +3261,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
pfmemalloc_watermark_ok(pgdat))
wake_up_all(&pgdat->pfmemalloc_wait);

- /*
- * Fragmentation may mean that the system cannot be rebalanced
- * for high-order allocations in all zones. If twice the
- * allocation size has been reclaimed and the zones are still
- * not balanced then recheck the watermarks at order-0 to
- * prevent kswapd reclaiming excessively. Assume that a
- * process requested a high-order can direct reclaim/compact.
- */
- if (order && sc.nr_reclaimed >= 2UL << order)
- order = sc.order = 0;
-
/* Check if kswapd should be suspending */
if (try_to_freeze() || kthread_should_stop())
break;

/*
- * Compact if necessary and kswapd is reclaiming at least the
- * high watermark number of pages as requsted
- */
- if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)
- compact_pgdat(pgdat, order);
-
- /*
* Raise priority if scanning rate is too low or there was no
* progress in reclaiming pages
*/
if (raise_priority || !sc.nr_reclaimed)
sc.priority--;
} while (sc.priority >= 1 &&
- !pgdat_balanced(pgdat, order, *classzone_idx));
+ !pgdat_balanced(pgdat, order, classzone_idx));

out:
/*
- * Return the order we were reclaiming at so prepare_kswapd_sleep()
- * makes a decision on the order we were last reclaiming at. However,
- * if another caller entered the allocator slow path while kswapd
- * was awake, order will remain at the higher level
+ * Return the highest zone idx we were reclaiming at so
+ * prepare_kswapd_sleep() makes the same decisions as here.
*/
- *classzone_idx = end_zone;
- return order;
+ return end_zone;
}

-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
+ int classzone_idx, int balanced_classzone_idx)
{
long remaining = 0;
DEFINE_WAIT(wait);
@@ -3347,7 +3294,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);

/* Try to sleep for a short interval */
- if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
+ if (prepare_kswapd_sleep(pgdat, order, remaining,
+ balanced_classzone_idx)) {
remaining = schedule_timeout(HZ/10);
finish_wait(&pgdat->kswapd_wait, &wait);
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -3357,7 +3305,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
* After a short sleep, check if it was a premature sleep. If not, then
* go fully to sleep until explicitly woken up.
*/
- if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
+ if (prepare_kswapd_sleep(pgdat, order, remaining,
+ balanced_classzone_idx)) {
trace_mm_vmscan_kswapd_sleep(pgdat->node_id);

/*
@@ -3378,6 +3327,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
*/
reset_isolation_suitable(pgdat);

+ /*
+ * We have freed the memory, now we should compact it to make
+ * allocation of the requested order possible.
+ */
+ wakeup_kcompactd(pgdat, order, classzone_idx);
+
if (!kthread_should_stop())
schedule();

@@ -3407,7 +3362,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
static int kswapd(void *p)
{
unsigned long order, new_order;
- unsigned balanced_order;
int classzone_idx, new_classzone_idx;
int balanced_classzone_idx;
pg_data_t *pgdat = (pg_data_t*)p;
@@ -3440,23 +3394,19 @@ static int kswapd(void *p)
set_freezable();

order = new_order = 0;
- balanced_order = 0;
classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
balanced_classzone_idx = classzone_idx;
for ( ; ; ) {
bool ret;

/*
- * If the last balance_pgdat was unsuccessful it's unlikely a
- * new request of a similar or harder type will succeed soon
- * so consider going to sleep on the basis we reclaimed at
+ * While we were reclaiming, there might have been another
+ * wakeup, so check the values.
*/
- if (balanced_order == new_order) {
- new_order = pgdat->kswapd_max_order;
- new_classzone_idx = pgdat->classzone_idx;
- pgdat->kswapd_max_order = 0;
- pgdat->classzone_idx = pgdat->nr_zones - 1;
- }
+ new_order = pgdat->kswapd_max_order;
+ new_classzone_idx = pgdat->classzone_idx;
+ pgdat->kswapd_max_order = 0;
+ pgdat->classzone_idx = pgdat->nr_zones - 1;

if (order < new_order || classzone_idx > new_classzone_idx) {
/*
@@ -3466,7 +3416,7 @@ static int kswapd(void *p)
order = new_order;
classzone_idx = new_classzone_idx;
} else {
- kswapd_try_to_sleep(pgdat, balanced_order,
+ kswapd_try_to_sleep(pgdat, order, classzone_idx,
balanced_classzone_idx);
order = pgdat->kswapd_max_order;
classzone_idx = pgdat->classzone_idx;
@@ -3486,9 +3436,8 @@ static int kswapd(void *p)
*/
if (!ret) {
trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
- balanced_classzone_idx = classzone_idx;
- balanced_order = balance_pgdat(pgdat, order,
- &balanced_classzone_idx);
+ balanced_classzone_idx = balance_pgdat(pgdat, order,
+ classzone_idx);
}
}

@@ -3518,7 +3467,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
}
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
- if (zone_balanced(zone, order, 0, 0))
+ if (zone_balanced(zone, order, true, 0, 0))
return;

trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
--
2.7.2