2020-02-07 22:56:53

by Ivan Babrou

[permalink] [raw]
Subject: Reclaim regression after 1c30844d2dfe

This change from 5.5 times:

* https://github.com/torvalds/linux/commit/1c30844d2dfe

> mm: reclaim small amounts of memory when an external fragmentation event occurs

Introduced undesired effects in our environment.

* NUMA with 2 x CPU
* 128GB of RAM
* THP disabled
* Upgraded from 4.19 to 5.4

Before we saw free memory hover at around 1.4GB with no spikes. After
the upgrade we saw some machines decide that they need a lot more than
that, with frequent spikes above 10GB, often only on a single numa
node.

We can see kswapd quite active in balance_pgdat (it didn't look like
it slept at all):

$ ps uax | fgrep kswapd
root 1850 23.0 0.0 0 0 ? R Jan30 1902:24 [kswapd0]
root 1851 1.8 0.0 0 0 ? S Jan30 152:16 [kswapd1]

This in turn massively increased pressure on page cache, which did not
go well to services that depend on having a quick response from a
local cache backed by solid storage.

Here's how it looked like when I zeroed vm.watermark_boost_factor:

* https://imgur.com/a/6IZWicU

IO subsided from 100% busy in page cache population at 300MB/s on a
single SATA drive down to under 100MB/s.

This sort of regression doesn't seem like a good thing.


2020-02-07 23:06:48

by Rik van Riel

[permalink] [raw]
Subject: Re: Reclaim regression after 1c30844d2dfe

On Fri, 2020-02-07 at 14:54 -0800, Ivan Babrou wrote:
> This change from 5.5 times:
>
> * https://github.com/torvalds/linux/commit/1c30844d2dfe
>
> > mm: reclaim small amounts of memory when an external fragmentation
> > event occurs
>
> Introduced undesired effects in our environment.
>
> * NUMA with 2 x CPU
> * 128GB of RAM
> * THP disabled
> * Upgraded from 4.19 to 5.4
>
> Before we saw free memory hover at around 1.4GB with no spikes. After
> the upgrade we saw some machines decide that they need a lot more
> than
> that, with frequent spikes above 10GB, often only on a single numa
> node.
>
> We can see kswapd quite active in balance_pgdat (it didn't look like
> it slept at all):
>
> $ ps uax | fgrep kswapd
> root 1850 23.0 0.0 0 0 ? R Jan30 1902:24
> [kswapd0]
> root 1851 1.8 0.0 0 0 ? S Jan30 152:16
> [kswapd1]
>
> This in turn massively increased pressure on page cache, which did
> not
> go well to services that depend on having a quick response from a
> local cache backed by solid storage.
>
> Here's how it looked like when I zeroed vm.watermark_boost_factor:

We have observed the same thing, even on single node systems.

I have some hacky patches to apply the watermark_boost thing on
a per pgdat basis, which seems to resolve the issue, but I have
not yet found the time to get the locking for that correct.

Given how rare the watermark boosting is, maybe the answer is
just to use atomics? Not sure :)

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-02-08 09:08:50

by Vlastimil Babka

[permalink] [raw]
Subject: Re: Reclaim regression after 1c30844d2dfe

On 2/8/20 12:05 AM, Rik van Riel wrote:
> On Fri, 2020-02-07 at 14:54 -0800, Ivan Babrou wrote:
>> This change from 5.5 times:
>>
>> * https://github.com/torvalds/linux/commit/1c30844d2dfe
>>
>>> mm: reclaim small amounts of memory when an external fragmentation
>>> event occurs
>>
>> Introduced undesired effects in our environment.
>>
>> * NUMA with 2 x CPU
>> * 128GB of RAM
>> * THP disabled
>> * Upgraded from 4.19 to 5.4
>>
>> Before we saw free memory hover at around 1.4GB with no spikes. After
>> the upgrade we saw some machines decide that they need a lot more
>> than
>> that, with frequent spikes above 10GB, often only on a single numa
>> node.
>>
>> We can see kswapd quite active in balance_pgdat (it didn't look like
>> it slept at all):
>>
>> $ ps uax | fgrep kswapd
>> root 1850 23.0 0.0 0 0 ? R Jan30 1902:24
>> [kswapd0]
>> root 1851 1.8 0.0 0 0 ? S Jan30 152:16
>> [kswapd1]
>>
>> This in turn massively increased pressure on page cache, which did
>> not
>> go well to services that depend on having a quick response from a
>> local cache backed by solid storage.
>>
>> Here's how it looked like when I zeroed vm.watermark_boost_factor:
>
> We have observed the same thing, even on single node systems.
>
> I have some hacky patches to apply the watermark_boost thing on
> a per pgdat basis, which seems to resolve the issue, but I have
> not yet found the time to get the locking for that correct.

I wonder why per-pgdat basis would help in general (might help some
corcner cases?). Because I guess fundamentally the issue is the part
"reclaim an amount of memory relative to the size of the high watermark
and the watermark_boost_factor until the boost is cleared".
That means no matter how much memory there is already free, it will keep
reclaiming until nr_boost_reclaim reaches zero. This danger of runaway
reclaim wouldn't be there if it only reclaimed up to the boosted
watermark (or some watermark derived from that).

But yeah it's also weird that if you have so much free memory, you keep
getting the external fragmentation events that wake up kswapd for
boosting in the first place. Worth investigating too.

> Given how rare the watermark boosting is, maybe the answer is
> just to use atomics? Not sure :)
>

2020-02-11 10:17:56

by Mel Gorman

[permalink] [raw]
Subject: Re: Reclaim regression after 1c30844d2dfe

On Fri, Feb 07, 2020 at 02:54:43PM -0800, Ivan Babrou wrote:
> This change from 5.5 times:
>
> * https://github.com/torvalds/linux/commit/1c30844d2dfe
>
> > mm: reclaim small amounts of memory when an external fragmentation event occurs
>
> Introduced undesired effects in our environment.
>
> * NUMA with 2 x CPU
> * 128GB of RAM
> * THP disabled
> * Upgraded from 4.19 to 5.4
>
> Before we saw free memory hover at around 1.4GB with no spikes. After
> the upgrade we saw some machines decide that they need a lot more than
> that, with frequent spikes above 10GB, often only on a single numa
> node.
>
> We can see kswapd quite active in balance_pgdat (it didn't look like
> it slept at all):
>
> $ ps uax | fgrep kswapd
> root 1850 23.0 0.0 0 0 ? R Jan30 1902:24 [kswapd0]
> root 1851 1.8 0.0 0 0 ? S Jan30 152:16 [kswapd1]
>
> This in turn massively increased pressure on page cache, which did not
> go well to services that depend on having a quick response from a
> local cache backed by solid storage.
>
> Here's how it looked like when I zeroed vm.watermark_boost_factor:
>
> * https://imgur.com/a/6IZWicU
>
> IO subsided from 100% busy in page cache population at 300MB/s on a
> single SATA drive down to under 100MB/s.
>
> This sort of regression doesn't seem like a good thing.

It is not a good thing, so thanks for the report. Obviously I have not
seen something similar or least not severe enough to show up on my radar.
I'd seen some increases with reclaim activity affecting benchmarks that
rely on use-twice data remaining resident but nothing severe enough to
warrant action.

Can you tell me if it is *always* node 0 that shows crazy activity? I
ask because some conditions would have to be met for the boost to always
apply. It's already a per-zone attribute but it is treated indirectly as a
pgdat property. What I'm thinking is that on node 0, the DMA32 or DMA zone
gets boosted but vmscan then reclaims from higher zones until the boost is
removed. That would excessively reclaim memory but be specific to node 0.

I've cc'd Rik as he says he saw something similar even on single node
systems. The boost applying to lower zones would still affect single
node systems but NUMA machines always getting impacted by boost would
show that the boost really needs to be a per-node flag. Sure, we *could*
apply the reclaim to just the lower zones but that potentially means a
*lot* of scan activity -- potentially 124G of pages before a lower zone
page is found on Ivan's machine. That might be the very situation being
encountered here.

An alternative is that boosting is only ever applied to the highest
populated zone in a system. The intent of the patch was primarily about
THP which can use any zone to reduce their allocaation latency. While
it's possible that there are cases where the latency of other orders
matter *and* they require lower zones, I think it's unlikely and that
this would be a safer option overall.

However, overall I think the simpliest is to abort the boosting if
reclaim is reaching higher priorities without being able to clear
the boost. The boost is best-effort to reduce allocation latency in
the future. This approach still has some overhead as there is a reclaim
pass but kswapd will abort and go to sleep if the normal watermarks
are met.

This is build tested only. Ideally someone on the cc has a test case
that can reproduce this specific problem of excessive kswapd activity.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 572fb17c6273..71dd47172cef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3462,6 +3462,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
return false;
}

+static void acct_boosted_reclaim(pg_data_t *pgdat, int classzone_idx,
+ unsigned long *zone_boosts)
+{
+ struct zone *zone;
+ unsigned long flags;
+ int i;
+
+ for (i = 0; i <= classzone_idx; i++) {
+ if (!zone_boosts[i])
+ continue;
+
+ /* Increments are under the zone lock */
+ zone = pgdat->node_zones + i;
+ spin_lock_irqsave(&zone->lock, flags);
+ zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
/* Clear pgdat state for congested, dirty or under writeback. */
static void clear_pgdat_congested(pg_data_t *pgdat)
{
@@ -3654,9 +3673,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
if (!nr_boost_reclaim && balanced)
goto out;

- /* Limit the priority of boosting to avoid reclaim writeback */
- if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
- raise_priority = false;
+ /*
+ * Abort boosting if reclaiming at higher priority is not
+ * working to avoid excessive reclaim due to lower zones
+ * being boosted.
+ */
+ if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) {
+ acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
+ boosted = false;
+ nr_boost_reclaim = 0;
+ goto restart;
+ }

/*
* Do not writeback or swap pages for boosted reclaim. The
@@ -3738,18 +3765,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
out:
/* If reclaim was boosted, account for the reclaim done in this pass */
if (boosted) {
- unsigned long flags;
-
- for (i = 0; i <= classzone_idx; i++) {
- if (!zone_boosts[i])
- continue;
-
- /* Increments are under the zone lock */
- zone = pgdat->node_zones + i;
- spin_lock_irqsave(&zone->lock, flags);
- zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
- spin_unlock_irqrestore(&zone->lock, flags);
- }
+ acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);

/*
* As there is now likely space, wakeup kcompact to defragment

2020-02-12 22:46:14

by Ivan Babrou

[permalink] [raw]
Subject: Re: Reclaim regression after 1c30844d2dfe

Here's a typical graph: https://imgur.com/a/n03x5yH

* Green (numa0) and blue (numa1) for 4.19
* Yellow (numa0) and orange (numa1) for 5.4

These downward slopes on numa0 on 5.4 are somewhat typical to the
worst case scenario.

If I try to clean up data a bit from a bunch of machines, this is how
numa0 compares to numa1 with 1h average values of free memory above
5GiB:

* https://imgur.com/a/6T4rRzi

I think it's safe to say that numa0 is much much worse, but I cannot
be 100% sure that numa1 is free from adverse effects, they may be just
hiding in the noise caused by rolling reboots.


On Tue, Feb 11, 2020 at 2:16 AM Mel Gorman <[email protected]> wrote:
>
> On Fri, Feb 07, 2020 at 02:54:43PM -0800, Ivan Babrou wrote:
> > This change from 5.5 times:
> >
> > * https://github.com/torvalds/linux/commit/1c30844d2dfe
> >
> > > mm: reclaim small amounts of memory when an external fragmentation event occurs
> >
> > Introduced undesired effects in our environment.
> >
> > * NUMA with 2 x CPU
> > * 128GB of RAM
> > * THP disabled
> > * Upgraded from 4.19 to 5.4
> >
> > Before we saw free memory hover at around 1.4GB with no spikes. After
> > the upgrade we saw some machines decide that they need a lot more than
> > that, with frequent spikes above 10GB, often only on a single numa
> > node.
> >
> > We can see kswapd quite active in balance_pgdat (it didn't look like
> > it slept at all):
> >
> > $ ps uax | fgrep kswapd
> > root 1850 23.0 0.0 0 0 ? R Jan30 1902:24 [kswapd0]
> > root 1851 1.8 0.0 0 0 ? S Jan30 152:16 [kswapd1]
> >
> > This in turn massively increased pressure on page cache, which did not
> > go well to services that depend on having a quick response from a
> > local cache backed by solid storage.
> >
> > Here's how it looked like when I zeroed vm.watermark_boost_factor:
> >
> > * https://imgur.com/a/6IZWicU
> >
> > IO subsided from 100% busy in page cache population at 300MB/s on a
> > single SATA drive down to under 100MB/s.
> >
> > This sort of regression doesn't seem like a good thing.
>
> It is not a good thing, so thanks for the report. Obviously I have not
> seen something similar or least not severe enough to show up on my radar.
> I'd seen some increases with reclaim activity affecting benchmarks that
> rely on use-twice data remaining resident but nothing severe enough to
> warrant action.
>
> Can you tell me if it is *always* node 0 that shows crazy activity? I
> ask because some conditions would have to be met for the boost to always
> apply. It's already a per-zone attribute but it is treated indirectly as a
> pgdat property. What I'm thinking is that on node 0, the DMA32 or DMA zone
> gets boosted but vmscan then reclaims from higher zones until the boost is
> removed. That would excessively reclaim memory but be specific to node 0.
>
> I've cc'd Rik as he says he saw something similar even on single node
> systems. The boost applying to lower zones would still affect single
> node systems but NUMA machines always getting impacted by boost would
> show that the boost really needs to be a per-node flag. Sure, we *could*
> apply the reclaim to just the lower zones but that potentially means a
> *lot* of scan activity -- potentially 124G of pages before a lower zone
> page is found on Ivan's machine. That might be the very situation being
> encountered here.
>
> An alternative is that boosting is only ever applied to the highest
> populated zone in a system. The intent of the patch was primarily about
> THP which can use any zone to reduce their allocaation latency. While
> it's possible that there are cases where the latency of other orders
> matter *and* they require lower zones, I think it's unlikely and that
> this would be a safer option overall.
>
> However, overall I think the simpliest is to abort the boosting if
> reclaim is reaching higher priorities without being able to clear
> the boost. The boost is best-effort to reduce allocation latency in
> the future. This approach still has some overhead as there is a reclaim
> pass but kswapd will abort and go to sleep if the normal watermarks
> are met.
>
> This is build tested only. Ideally someone on the cc has a test case
> that can reproduce this specific problem of excessive kswapd activity.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 572fb17c6273..71dd47172cef 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3462,6 +3462,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
> return false;
> }
>
> +static void acct_boosted_reclaim(pg_data_t *pgdat, int classzone_idx,
> + unsigned long *zone_boosts)
> +{
> + struct zone *zone;
> + unsigned long flags;
> + int i;
> +
> + for (i = 0; i <= classzone_idx; i++) {
> + if (!zone_boosts[i])
> + continue;
> +
> + /* Increments are under the zone lock */
> + zone = pgdat->node_zones + i;
> + spin_lock_irqsave(&zone->lock, flags);
> + zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
> + spin_unlock_irqrestore(&zone->lock, flags);
> + }
> +}
> +
> /* Clear pgdat state for congested, dirty or under writeback. */
> static void clear_pgdat_congested(pg_data_t *pgdat)
> {
> @@ -3654,9 +3673,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> if (!nr_boost_reclaim && balanced)
> goto out;
>
> - /* Limit the priority of boosting to avoid reclaim writeback */
> - if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
> - raise_priority = false;
> + /*
> + * Abort boosting if reclaiming at higher priority is not
> + * working to avoid excessive reclaim due to lower zones
> + * being boosted.
> + */
> + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) {
> + acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
> + boosted = false;
> + nr_boost_reclaim = 0;
> + goto restart;
> + }
>
> /*
> * Do not writeback or swap pages for boosted reclaim. The
> @@ -3738,18 +3765,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> out:
> /* If reclaim was boosted, account for the reclaim done in this pass */
> if (boosted) {
> - unsigned long flags;
> -
> - for (i = 0; i <= classzone_idx; i++) {
> - if (!zone_boosts[i])
> - continue;
> -
> - /* Increments are under the zone lock */
> - zone = pgdat->node_zones + i;
> - spin_lock_irqsave(&zone->lock, flags);
> - zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
> - spin_unlock_irqrestore(&zone->lock, flags);
> - }
> + acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
>
> /*
> * As there is now likely space, wakeup kcompact to defragment

2020-02-12 23:56:22

by Mel Gorman

[permalink] [raw]
Subject: Re: Reclaim regression after 1c30844d2dfe

On Wed, Feb 12, 2020 at 02:45:39PM -0800, Ivan Babrou wrote:
> Here's a typical graph: https://imgur.com/a/n03x5yH
>
> * Green (numa0) and blue (numa1) for 4.19
> * Yellow (numa0) and orange (numa1) for 5.4
>
> These downward slopes on numa0 on 5.4 are somewhat typical to the
> worst case scenario.
>
> If I try to clean up data a bit from a bunch of machines, this is how
> numa0 compares to numa1 with 1h average values of free memory above
> 5GiB:
>
> * https://imgur.com/a/6T4rRzi
>
> I think it's safe to say that numa0 is much much worse, but I cannot
> be 100% sure that numa1 is free from adverse effects, they may be just
> hiding in the noise caused by rolling reboots.
>

Ok, while I expected node 0 to be worse in general, a runaway boost due
to constant fragmentation would be a problem in general. In either case,
the patch should reduce the damage. Is there any chance that the patch
can be tested or would it be disruptive for you?

--
Mel Gorman
SUSE Labs

2020-02-18 22:09:50

by Ivan Babrou

[permalink] [raw]
Subject: Re: Reclaim regression after 1c30844d2dfe

I won't have time to try the patch for the next three weeks or so, sorry.

On Wed, Feb 12, 2020 at 3:55 PM Mel Gorman <[email protected]> wrote:
>
> On Wed, Feb 12, 2020 at 02:45:39PM -0800, Ivan Babrou wrote:
> > Here's a typical graph: https://imgur.com/a/n03x5yH
> >
> > * Green (numa0) and blue (numa1) for 4.19
> > * Yellow (numa0) and orange (numa1) for 5.4
> >
> > These downward slopes on numa0 on 5.4 are somewhat typical to the
> > worst case scenario.
> >
> > If I try to clean up data a bit from a bunch of machines, this is how
> > numa0 compares to numa1 with 1h average values of free memory above
> > 5GiB:
> >
> > * https://imgur.com/a/6T4rRzi
> >
> > I think it's safe to say that numa0 is much much worse, but I cannot
> > be 100% sure that numa1 is free from adverse effects, they may be just
> > hiding in the noise caused by rolling reboots.
> >
>
> Ok, while I expected node 0 to be worse in general, a runaway boost due
> to constant fragmentation would be a problem in general. In either case,
> the patch should reduce the damage. Is there any chance that the patch
> can be tested or would it be disruptive for you?
>
> --
> Mel Gorman
> SUSE Labs