2020-02-25 14:16:04

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/3] Limit runaway reclaim due to watermark boosting

Ivan Babrou reported the following

Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when
an external fragmentation event occurs") introduced undesired
effects in our environment.

* NUMA with 2 x CPU
* 128GB of RAM
* THP disabled
* Upgraded from 4.19 to 5.4

Before we saw free memory hover at around 1.4GB with no
spikes. After the upgrade we saw some machines decide that they
need a lot more than that, with frequent spikes above 10GB,
often only on a single numa node.

There have been a few reports recently that might be watermark boost
related. Unfortunately, finding someone that can reproduce the problem
and test a patch has been problematic. This series intends to limit
potential damage only.

Patch 1 disables boosting on small memory systems.

Patch 2 disables boosting by default if THP is disabled on the kernel
command line on the basis that boosting primarily helps THP
allocation latency. It is not touched at runtime to avoid
overriding an explicit user request

Patch 3 disables boosting if kswapd priority is elevateed to avoid
excessive reclaim.

mm/huge_memory.c | 1 +
mm/internal.h | 6 +++++-
mm/page_alloc.c | 6 ++++--
mm/vmscan.c | 46 +++++++++++++++++++++++++++++++---------------
4 files changed, 41 insertions(+), 18 deletions(-)

--
2.16.4


2020-02-25 14:16:07

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/3] mm, page_alloc: Disable boosted watermark based reclaim on low-memory systems

An off-list bug report indicated that watermark boosting was affecting a
very small memory system and causing excessive reclaim. Watermark boosting
is intended to reduce the mixing of page mobility types within pageblocks
for high-order allocations. If the system has so little memory that pages
are not even grouped by mobility, then watermark boosting should also
be disabled.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 6 +++++-
mm/page_alloc.c | 6 ++++--
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3cf20ab3ca01..439561cc90d9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -534,10 +534,14 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
#endif
#define ALLOC_KSWAPD 0x200 /* allow waking of kswapd */

+static inline void disable_watermark_boosting(void)
+{
+ watermark_boost_factor = 0;
+}
+
enum ttu_flags;
struct tlbflush_unmap_batch;

-
/*
* only for MM internal work items which do not depend on
* any allocations or locks which might depend on allocations
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..c5c05b6dc459 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5810,10 +5810,12 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
* made on memory-hotadd so a system can start with mobility
* disabled and enable it later
*/
- if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES))
+ if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES)) {
page_group_by_mobility_disabled = 1;
- else
+ disable_watermark_boosting();
+ } else {
page_group_by_mobility_disabled = 0;
+ }

pr_info("Built %u zonelists, mobility grouping %s. Total pages: %ld\n",
nr_online_nodes,
--
2.16.4

2020-02-25 14:16:34

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/3] mm, vmscan: Do not reclaim for boosted watermarks at high priority

Ivan Babrou reported the following (slightly paraphrased)

Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when
an external fragmentation event occurs") introduced undesired
effects in our environment.

* NUMA with 2 x CPU
* 128GB of RAM
* THP disabled
* Upgraded from 4.19 to 5.4

Before we saw free memory hover at around 1.4GB with no
spikes. After the upgrade we saw some machines decide that they
need a lot more than that, with frequent spikes above 10GB,
often only on a single numa node.

We can see kswapd quite active in balance_pgdat (it didn't look
like it slept at all):

$ ps uax | fgrep kswapd
root 1850 23.0 0.0 0 0 ? R Jan30 1902:24 [kswapd0]
root 1851 1.8 0.0 0 0 ? S Jan30 152:16 [kswapd1]

This in turn massively increased pressure on page cache, which did not
go well to services that depend on having a quick response from a
local cache backed by solid storage.

Rik van Riel indicated that he had observed something similar. Details
are sparse but the bulk of the excessive reclaim activity appears to
be on node 0. My belief is that on node 0, a DMA32 or DMA zone can get
boosted but vmscan then reclaims from higher zones until the boost is
removed. While we could apply the reclaim to just the lower zones, it
would result in a lot of pages skipped during scanning.

Watermark boosting is inherently optimisitc and is only applied to
reduce the possibility of pageblocks being mixed further in the future so
high-order allocations are both more likely to succeed and be allocated
with lower latency. It was not intended that it reclaim the world ever.

This patch limits watermark boosting. If reclaim reaches a higher priority
then reclaim based on watermark boosting is aborted. Unfortunately,
the bug reporters are not in the position to actually test this but it
makes sense that watermark boosting be aborted quickly when reclaim is
not making progress given that boosting was never intended to reclaim or
scan excessively.

While I could not reproduce the problem locally, this was compared against
a vanilla kernel and one with watermark boosting disabled. The test results
still indicate that this helps the workloads addressed by 1c30844d2dfe
("mm: reclaim small amounts of memory when an external fragmentation event
occurs") although the behaviour of THP allocation has changed since making
a direct comparison problematic. At worst, this patch will mitigate a
problem when watermarks are persistently boosted.

Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 46 +++++++++++++++++++++++++++++++---------------
1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 876370565455..40c9e48dc542 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3449,6 +3449,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
return false;
}

+static void acct_boosted_reclaim(pg_data_t *pgdat, int classzone_idx,
+ unsigned long *zone_boosts)
+{
+ struct zone *zone;
+ unsigned long flags;
+ int i;
+
+ for (i = 0; i <= classzone_idx; i++) {
+ if (!zone_boosts[i])
+ continue;
+
+ /* Increments are under the zone lock */
+ zone = pgdat->node_zones + i;
+ spin_lock_irqsave(&zone->lock, flags);
+ zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
/* Clear pgdat state for congested, dirty or under writeback. */
static void clear_pgdat_congested(pg_data_t *pgdat)
{
@@ -3641,9 +3660,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
if (!nr_boost_reclaim && balanced)
goto out;

- /* Limit the priority of boosting to avoid reclaim writeback */
- if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
- raise_priority = false;
+ /*
+ * Abort boosting if reclaiming at higher priority is not
+ * working to avoid excessive reclaim due to lower zones
+ * being boosted.
+ */
+ if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) {
+ acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
+ boosted = false;
+ nr_boost_reclaim = 0;
+ goto restart;
+ }

/*
* Do not writeback or swap pages for boosted reclaim. The
@@ -3725,18 +3752,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
out:
/* If reclaim was boosted, account for the reclaim done in this pass */
if (boosted) {
- unsigned long flags;
-
- for (i = 0; i <= classzone_idx; i++) {
- if (!zone_boosts[i])
- continue;
-
- /* Increments are under the zone lock */
- zone = pgdat->node_zones + i;
- spin_lock_irqsave(&zone->lock, flags);
- zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
- spin_unlock_irqrestore(&zone->lock, flags);
- }
+ acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);

/*
* As there is now likely space, wakeup kcompact to defragment
--
2.16.4

2020-02-25 14:17:21

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/3] mm, page_alloc: Disable watermark boosting if THP is disabled at boot

Watermark boosting is intended to increase the success rate and reduce
latency of high-order allocations, particularly THP. If THP is disabled
at boot, then it makes sense to disable watermark boosting as well. While
there are other high-order allocations that potentially benefit, they
are relatively rare.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..565bb9973ff8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -472,6 +472,7 @@ static int __init setup_transparent_hugepage(char *str)
&transparent_hugepage_flags);
clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
&transparent_hugepage_flags);
+ disable_watermark_boosting();
ret = 1;
}
out:
--
2.16.4

2020-02-26 01:32:53

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, page_alloc: Disable watermark boosting if THP is disabled at boot

On Tue, 25 Feb 2020, Mel Gorman wrote:

> Watermark boosting is intended to increase the success rate and reduce
> latency of high-order allocations, particularly THP. If THP is disabled
> at boot, then it makes sense to disable watermark boosting as well. While
> there are other high-order allocations that potentially benefit, they
> are relatively rare.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/huge_memory.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b08b199f9a11..565bb9973ff8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -472,6 +472,7 @@ static int __init setup_transparent_hugepage(char *str)
> &transparent_hugepage_flags);
> clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> &transparent_hugepage_flags);
> + disable_watermark_boosting();
> ret = 1;
> }
> out:

Seems like watermark boosting can help prevent fragmentation so it
benefits all hugepage sized allocations for the long term and that would
include dynamic provisioning of hugetlb memory or hugetlb overcommit?

2020-02-26 02:51:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] Limit runaway reclaim due to watermark boosting

On Tue, 25 Feb 2020 14:15:31 +0000 Mel Gorman <[email protected]> wrote:

> Ivan Babrou reported the following

http://lkml.kernel.org/r/CABWYdi1eOUD1DHORJxTsWPMT3BcZhz++xP1pXhT=x4SgxtgQZA@mail.gmail.com
is helpful.

> Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when
> an external fragmentation event occurs") introduced undesired
> effects in our environment.
>
> * NUMA with 2 x CPU
> * 128GB of RAM
> * THP disabled
> * Upgraded from 4.19 to 5.4
>
> Before we saw free memory hover at around 1.4GB with no
> spikes. After the upgrade we saw some machines decide that they
> need a lot more than that, with frequent spikes above 10GB,
> often only on a single numa node.
>
> There have been a few reports recently that might be watermark boost
> related. Unfortunately, finding someone that can reproduce the problem
> and test a patch has been problematic. This series intends to limit
> potential damage only.

It's problematic that we don't understand what's happening. And these
palliatives can only reduce our ability to do that.

Rik seems to have the means to reproduce this (or something similar)
and it seems Ivan can test patches three weeks hence. So how about a
debug patch which will help figure out what's going on in there?

2020-02-26 08:06:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Limit runaway reclaim due to watermark boosting

On Tue, Feb 25, 2020 at 06:51:30PM -0800, Andrew Morton wrote:
> On Tue, 25 Feb 2020 14:15:31 +0000 Mel Gorman <[email protected]> wrote:
>
> > Ivan Babrou reported the following
>
> http://lkml.kernel.org/r/CABWYdi1eOUD1DHORJxTsWPMT3BcZhz++xP1pXhT=x4SgxtgQZA@mail.gmail.com
> is helpful.
>

Noted for future reference.

> > Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when
> > an external fragmentation event occurs") introduced undesired
> > effects in our environment.
> >
> > * NUMA with 2 x CPU
> > * 128GB of RAM
> > * THP disabled
> > * Upgraded from 4.19 to 5.4
> >
> > Before we saw free memory hover at around 1.4GB with no
> > spikes. After the upgrade we saw some machines decide that they
> > need a lot more than that, with frequent spikes above 10GB,
> > often only on a single numa node.
> >
> > There have been a few reports recently that might be watermark boost
> > related. Unfortunately, finding someone that can reproduce the problem
> > and test a patch has been problematic. This series intends to limit
> > potential damage only.
>
> It's problematic that we don't understand what's happening. And these
> palliatives can only reduce our ability to do that.
>

Not for certain no, but we do know that there are conditions whereby
node 0 can end up reclaiming excessively for extended periods of time.
The available evidence does match a pattern whereby a lower zone on node
0 is getting stuck in a boosted state.

> Rik seems to have the means to reproduce this (or something similar)
> and it seems Ivan can test patches three weeks hence.

If Rik can reproduce it great but I have a strong feeling that Ivan may
never be able to test this if it requires a production machine which is
why I did not wait the three weeks.

> So how about a
> debug patch which will help figure out what's going on in there?

A debug patch would not help much in this case given that we
have tracepoints. An ftrace containing mm_page_alloc_extfrag,
mm_vmscan_kswapd_wake, mm_vmscan_wakeup_kswapd and
mm_vmscan_node_reclaim_begin would be a big help for 30 seconds while the
problem is occurring would work. Ideally mm_vmscan_lru_shrink_inactive
would also be included to capture the priority but the size of the trace
is what's going to be problematic.

mm_page_alloc_extfrag would be correlated with the conditions that boost
the watermarks and the others would track what kswapd is doing to see if
it's persistently reclaiming. If they are, mm_vmscan_lru_shrink_inactive
would tell if it's persistently reclaiming at priority DEF_PRIORITY - 2
which would prove the patch would at least mitigate the problem.

It would be more preferable to have a description of a testcase that
reproduces the problem and I'll capture/analyse the trace myself.
It would also be something I could slot into a test grid to catch the
problem happening again in the future.

--
Mel Gorman
SUSE Labs

2020-02-26 08:08:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, page_alloc: Disable watermark boosting if THP is disabled at boot

On Tue, Feb 25, 2020 at 05:32:24PM -0800, David Rientjes wrote:
> On Tue, 25 Feb 2020, Mel Gorman wrote:
>
> > Watermark boosting is intended to increase the success rate and reduce
> > latency of high-order allocations, particularly THP. If THP is disabled
> > at boot, then it makes sense to disable watermark boosting as well. While
> > there are other high-order allocations that potentially benefit, they
> > are relatively rare.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/huge_memory.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index b08b199f9a11..565bb9973ff8 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -472,6 +472,7 @@ static int __init setup_transparent_hugepage(char *str)
> > &transparent_hugepage_flags);
> > clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> > &transparent_hugepage_flags);
> > + disable_watermark_boosting();
> > ret = 1;
> > }
> > out:
>
> Seems like watermark boosting can help prevent fragmentation so it
> benefits all hugepage sized allocations for the long term and that would
> include dynamic provisioning of hugetlb memory or hugetlb overcommit?

Yes but it's very rare to hear of cases where hugetlb is dynamically
provisioned or overcommitted once THP existed and stopped stalling the
system excessively. I'm happy enough to drop this patch because I'm not
relying on it in the context of this bug.

--
Mel Gorman
SUSE Labs