2020-02-19 18:27:07

by Sultan Alsawaf

[permalink] [raw]
Subject: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

From: Sultan Alsawaf <[email protected]>

Keeping kswapd running when all the failed allocations that invoked it
are satisfied incurs a high overhead due to unnecessary page eviction
and writeback, as well as spurious VM pressure events to various
registered shrinkers. When kswapd doesn't need to work to make an
allocation succeed anymore, stop it prematurely to save resources.

Signed-off-by: Sultan Alsawaf <[email protected]>
---
include/linux/mmzone.h | 2 ++
mm/page_alloc.c | 17 ++++++++++++++---
mm/vmscan.c | 3 ++-
3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..49c922abfb90 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -20,6 +20,7 @@
#include <linux/atomic.h>
#include <linux/mm_types.h>
#include <linux/page-flags.h>
+#include <linux/refcount.h>
#include <asm/page.h>

/* Free memory management - zoned buddy allocator. */
@@ -735,6 +736,7 @@ typedef struct pglist_data {
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
+ refcount_t kswapd_waiters;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..2d4caacfd2fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4401,6 +4401,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int no_progress_loops;
unsigned int cpuset_mems_cookie;
int reserve_flags;
+ pg_data_t *pgdat = ac->preferred_zoneref->zone->zone_pgdat;
+ bool woke_kswapd = false;

/*
* We also sanity check to catch abuse of atomic reserves being used by
@@ -4434,8 +4436,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (!ac->preferred_zoneref->zone)
goto nopage;

- if (alloc_flags & ALLOC_KSWAPD)
+ if (alloc_flags & ALLOC_KSWAPD) {
+ if (!woke_kswapd) {
+ refcount_inc(&pgdat->kswapd_waiters);
+ woke_kswapd = true;
+ }
wake_all_kswapds(order, gfp_mask, ac);
+ }

/*
* The adjusted alloc_flags might result in immediate success, so try
@@ -4640,9 +4647,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto retry;
}
fail:
- warn_alloc(gfp_mask, ac->nodemask,
- "page allocation failure: order:%u", order);
got_pg:
+ if (woke_kswapd)
+ refcount_dec(&pgdat->kswapd_waiters);
+ if (!page)
+ warn_alloc(gfp_mask, ac->nodemask,
+ "page allocation failure: order:%u", order);
return page;
}

@@ -6711,6 +6721,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
pgdat_page_ext_init(pgdat);
spin_lock_init(&pgdat->lru_lock);
lruvec_init(&pgdat->__lruvec);
+ pgdat->kswapd_waiters = (refcount_t)REFCOUNT_INIT(0);
}

static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05eb9efec07..e795add372d1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3694,7 +3694,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
__fs_reclaim_release();
ret = try_to_freeze();
__fs_reclaim_acquire();
- if (ret || kthread_should_stop())
+ if (ret || kthread_should_stop() ||
+ !refcount_read(&pgdat->kswapd_waiters))
break;

/*
--
2.25.1


2020-02-19 19:14:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On 2/19/20 10:25 AM, Sultan Alsawaf wrote:
> Keeping kswapd running when all the failed allocations that invoked it
> are satisfied incurs a high overhead due to unnecessary page eviction
> and writeback, as well as spurious VM pressure events to various
> registered shrinkers. When kswapd doesn't need to work to make an
> allocation succeed anymore, stop it prematurely to save resources.

But kswapd isn't just to provide memory to waiters. It also serves to
get free memory back up to the high watermark. This seems like it might
result in more frequent allocation stalls and kswapd wakeups, which
consumes extra resources.

I guess I'd wonder what positive effects you have observed as a result
of this patch and whether you've gone looking for any negative effects.

2020-02-19 19:27:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed, 19 Feb 2020 10:25:22 -0800 Sultan Alsawaf <[email protected]> wrote:

> Keeping kswapd running when all the failed allocations that invoked it
> are satisfied incurs a high overhead due to unnecessary page eviction
> and writeback, as well as spurious VM pressure events to various
> registered shrinkers. When kswapd doesn't need to work to make an
> allocation succeed anymore, stop it prematurely to save resources.

Seems sensible.

Please fully describe the userspace-visible runtime effects of this
change?

2020-02-19 19:36:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

[Cc Mel and Johannes]

On Wed 19-02-20 10:25:22, Sultan Alsawaf wrote:
> From: Sultan Alsawaf <[email protected]>
>
> Keeping kswapd running when all the failed allocations that invoked it
> are satisfied incurs a high overhead due to unnecessary page eviction
> and writeback, as well as spurious VM pressure events to various
> registered shrinkers. When kswapd doesn't need to work to make an
> allocation succeed anymore, stop it prematurely to save resources.

I do not think this patch is correct. kswapd is supposed to balance a
node and get it up to the high watermark. The number of contexts which
woke it up is not really relevant. If for no other reasons then each
allocation request might be of a different size.

Could you be more specific about the problem you are trying to address
please?

> Signed-off-by: Sultan Alsawaf <[email protected]>
> ---
> include/linux/mmzone.h | 2 ++
> mm/page_alloc.c | 17 ++++++++++++++---
> mm/vmscan.c | 3 ++-
> 3 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 462f6873905a..49c922abfb90 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -20,6 +20,7 @@
> #include <linux/atomic.h>
> #include <linux/mm_types.h>
> #include <linux/page-flags.h>
> +#include <linux/refcount.h>
> #include <asm/page.h>
>
> /* Free memory management - zoned buddy allocator. */
> @@ -735,6 +736,7 @@ typedef struct pglist_data {
> unsigned long node_spanned_pages; /* total size of physical page
> range, including holes */
> int node_id;
> + refcount_t kswapd_waiters;
> wait_queue_head_t kswapd_wait;
> wait_queue_head_t pfmemalloc_wait;
> struct task_struct *kswapd; /* Protected by
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..2d4caacfd2fc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4401,6 +4401,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int no_progress_loops;
> unsigned int cpuset_mems_cookie;
> int reserve_flags;
> + pg_data_t *pgdat = ac->preferred_zoneref->zone->zone_pgdat;
> + bool woke_kswapd = false;
>
> /*
> * We also sanity check to catch abuse of atomic reserves being used by
> @@ -4434,8 +4436,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> if (!ac->preferred_zoneref->zone)
> goto nopage;
>
> - if (alloc_flags & ALLOC_KSWAPD)
> + if (alloc_flags & ALLOC_KSWAPD) {
> + if (!woke_kswapd) {
> + refcount_inc(&pgdat->kswapd_waiters);
> + woke_kswapd = true;
> + }
> wake_all_kswapds(order, gfp_mask, ac);
> + }
>
> /*
> * The adjusted alloc_flags might result in immediate success, so try
> @@ -4640,9 +4647,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> goto retry;
> }
> fail:
> - warn_alloc(gfp_mask, ac->nodemask,
> - "page allocation failure: order:%u", order);
> got_pg:
> + if (woke_kswapd)
> + refcount_dec(&pgdat->kswapd_waiters);
> + if (!page)
> + warn_alloc(gfp_mask, ac->nodemask,
> + "page allocation failure: order:%u", order);
> return page;
> }
>
> @@ -6711,6 +6721,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
> pgdat_page_ext_init(pgdat);
> spin_lock_init(&pgdat->lru_lock);
> lruvec_init(&pgdat->__lruvec);
> + pgdat->kswapd_waiters = (refcount_t)REFCOUNT_INIT(0);
> }
>
> static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c05eb9efec07..e795add372d1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3694,7 +3694,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> __fs_reclaim_release();
> ret = try_to_freeze();
> __fs_reclaim_acquire();
> - if (ret || kthread_should_stop())
> + if (ret || kthread_should_stop() ||
> + !refcount_read(&pgdat->kswapd_waiters))
> break;
>
> /*
> --
> 2.25.1
>

--
Michal Hocko
SUSE Labs

2020-02-19 19:40:32

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed, Feb 19, 2020 at 11:13:21AM -0800, Dave Hansen wrote:
> On 2/19/20 10:25 AM, Sultan Alsawaf wrote:
> > Keeping kswapd running when all the failed allocations that invoked it
> > are satisfied incurs a high overhead due to unnecessary page eviction
> > and writeback, as well as spurious VM pressure events to various
> > registered shrinkers. When kswapd doesn't need to work to make an
> > allocation succeed anymore, stop it prematurely to save resources.
>
> But kswapd isn't just to provide memory to waiters. It also serves to
> get free memory back up to the high watermark. This seems like it might
> result in more frequent allocation stalls and kswapd wakeups, which
> consumes extra resources.
>
> I guess I'd wonder what positive effects you have observed as a result
> of this patch and whether you've gone looking for any negative effects.

This patch essentially stops kswapd from going overboard when a failed
allocation fires up kswapd. Otherwise, when memory pressure is really high,
kswapd just chomps through CPU time freeing pages nonstop when it isn't needed.
On a constrained system I tested (mem=2G), this patch had the positive effect of
improving overall responsiveness at high memory pressure.

On systems with more memory I tested (>=4G), kswapd becomes more expensive to
run at its higher scan depths, so stopping kswapd prematurely when there aren't
any memory allocations waiting for it prevents it from reaching the *really*
expensive scan depths and burning through even more resources.

Combine a large amount of memory with a slow CPU and the current problematic
behavior of kswapd at high memory pressure shows. My personal test scenario for
this was an arm64 CPU with a variable amount of memory (up to 4G RAM + 2G swap).

Sultan

2020-02-19 20:07:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

[Ups, for some reason I have missed Dave's response previously]

On Wed 19-02-20 11:40:06, Sultan Alsawaf wrote:
> On Wed, Feb 19, 2020 at 11:13:21AM -0800, Dave Hansen wrote:
> > On 2/19/20 10:25 AM, Sultan Alsawaf wrote:
> > > Keeping kswapd running when all the failed allocations that invoked it
> > > are satisfied incurs a high overhead due to unnecessary page eviction
> > > and writeback, as well as spurious VM pressure events to various
> > > registered shrinkers. When kswapd doesn't need to work to make an
> > > allocation succeed anymore, stop it prematurely to save resources.
> >
> > But kswapd isn't just to provide memory to waiters. It also serves to
> > get free memory back up to the high watermark. This seems like it might
> > result in more frequent allocation stalls and kswapd wakeups, which
> > consumes extra resources.

Agreed as expressed in my other reply

> > I guess I'd wonder what positive effects you have observed as a result
> > of this patch and whether you've gone looking for any negative effects.
>
> This patch essentially stops kswapd from going overboard when a failed
> allocation fires up kswapd. Otherwise, when memory pressure is really high,
> kswapd just chomps through CPU time freeing pages nonstop when it isn't needed.

Could you be more specific please? kspwad should stop as soon as the
high watermark is reached. If that is not the case then there is a bug
which should be fixed.

Sure it is quite possible that kswapd is busy for extended amount of
time if the memory pressure is continuous.

> On a constrained system I tested (mem=2G), this patch had the positive effect of
> improving overall responsiveness at high memory pressure.

Again, do you have more details about the workload and what was the
cause of responsiveness issues? Because I would expect that the
situation would be quite opposite because it is usually the direct
reclaim that is a source of stalls visible from userspace. Or is this
about a single CPU situation where kswapd saturates the single CPU and
all other tasks are just not getting enough CPU cycles?

> On systems with more memory I tested (>=4G), kswapd becomes more expensive to
> run at its higher scan depths, so stopping kswapd prematurely when there aren't
> any memory allocations waiting for it prevents it from reaching the *really*
> expensive scan depths and burning through even more resources.
>
> Combine a large amount of memory with a slow CPU and the current problematic
> behavior of kswapd at high memory pressure shows. My personal test scenario for
> this was an arm64 CPU with a variable amount of memory (up to 4G RAM + 2G swap).

But still, somebody has to put the system into balanced state so who is
going to do all the work?
--
Michal Hocko
SUSE Labs

2020-02-19 20:42:46

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed, Feb 19, 2020 at 09:05:27PM +0100, Michal Hocko wrote:
> Could you be more specific please? kspwad should stop as soon as the
> high watermark is reached. If that is not the case then there is a bug
> which should be fixed.

No, there is no bug causing kswapd to continue beyond the high watermark.

> Sure it is quite possible that kswapd is busy for extended amount of
> time if the memory pressure is continuous.
>
> > On a constrained system I tested (mem=2G), this patch had the positive effect of
> > improving overall responsiveness at high memory pressure.
>
> Again, do you have more details about the workload and what was the
> cause of responsiveness issues? Because I would expect that the
> situation would be quite opposite because it is usually the direct
> reclaim that is a source of stalls visible from userspace. Or is this
> about a single CPU situation where kswapd saturates the single CPU and
> all other tasks are just not getting enough CPU cycles?

The workload was having lots of applications open at once. At a certain point
when memory ran low, my system became sluggish and kswapd CPU usage skyrocketed.
I added printks into kswapd with this patch, and my premature exit in kswapd
kicked in quite often.

> > On systems with more memory I tested (>=4G), kswapd becomes more expensive to
> > run at its higher scan depths, so stopping kswapd prematurely when there aren't
> > any memory allocations waiting for it prevents it from reaching the *really*
> > expensive scan depths and burning through even more resources.
> >
> > Combine a large amount of memory with a slow CPU and the current problematic
> > behavior of kswapd at high memory pressure shows. My personal test scenario for
> > this was an arm64 CPU with a variable amount of memory (up to 4G RAM + 2G swap).
>
> But still, somebody has to put the system into balanced state so who is
> going to do all the work?

All the work will be done by kswapd of course, but only if it's needed.

The real problem is that a single memory allocation failure, and free memory
being some amount below the high watermark, are not good heuristics to predict
*future* memory allocation needs. They are good for determining how to steer
kswapd to help satisfy a failed allocation in the present, but anything more is
pure speculation (which turns out to be wrong speculation, since this behavior
causes problems).

If there are outstanding failed allocations that won't go away, then it's
perfectly reasonable to keep kswapd running until it frees pages up to the high
watermark. But beyond that is unnecessary, since there's no way to know if or
when kswapd will need to fire up again. This makes sense considering how kswapd
is currently invoked: it's fired up due to a failed allocation of some sort, not
because the amount of free memory dropped below the high watermark.

Sultan

2020-02-19 21:45:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed, Feb 19, 2020 at 12:42:20PM -0800, Sultan Alsawaf wrote:
> > Again, do you have more details about the workload and what was the
> > cause of responsiveness issues? Because I would expect that the
> > situation would be quite opposite because it is usually the direct
> > reclaim that is a source of stalls visible from userspace. Or is this
> > about a single CPU situation where kswapd saturates the single CPU and
> > all other tasks are just not getting enough CPU cycles?
>
> The workload was having lots of applications open at once. At a certain point
> when memory ran low, my system became sluggish and kswapd CPU usage skyrocketed.
> I added printks into kswapd with this patch, and my premature exit in kswapd
> kicked in quite often.
>

This could be watermark boosting run wild again. Can you test with
sysctl vm.watermark_boost_factor=0 or the following patch? (preferably
both to compare and contrast).

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 572fb17c6273..71dd47172cef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3462,6 +3462,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
return false;
}

+static void acct_boosted_reclaim(pg_data_t *pgdat, int classzone_idx,
+ unsigned long *zone_boosts)
+{
+ struct zone *zone;
+ unsigned long flags;
+ int i;
+
+ for (i = 0; i <= classzone_idx; i++) {
+ if (!zone_boosts[i])
+ continue;
+
+ /* Increments are under the zone lock */
+ zone = pgdat->node_zones + i;
+ spin_lock_irqsave(&zone->lock, flags);
+ zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
/* Clear pgdat state for congested, dirty or under writeback. */
static void clear_pgdat_congested(pg_data_t *pgdat)
{
@@ -3654,9 +3673,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
if (!nr_boost_reclaim && balanced)
goto out;

- /* Limit the priority of boosting to avoid reclaim writeback */
- if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
- raise_priority = false;
+ /*
+ * Abort boosting if reclaiming at higher priority is not
+ * working to avoid excessive reclaim due to lower zones
+ * being boosted.
+ */
+ if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) {
+ acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
+ boosted = false;
+ nr_boost_reclaim = 0;
+ goto restart;
+ }

/*
* Do not writeback or swap pages for boosted reclaim. The
@@ -3738,18 +3765,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
out:
/* If reclaim was boosted, account for the reclaim done in this pass */
if (boosted) {
- unsigned long flags;
-
- for (i = 0; i <= classzone_idx; i++) {
- if (!zone_boosts[i])
- continue;
-
- /* Increments are under the zone lock */
- zone = pgdat->node_zones + i;
- spin_lock_irqsave(&zone->lock, flags);
- zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
- spin_unlock_irqrestore(&zone->lock, flags);
- }
+ acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);

/*
* As there is now likely space, wakeup kcompact to defragment

2020-02-19 22:44:00

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed, Feb 19, 2020 at 09:45:13PM +0000, Mel Gorman wrote:
> This could be watermark boosting run wild again. Can you test with
> sysctl vm.watermark_boost_factor=0 or the following patch? (preferably
> both to compare and contrast).

I can test that, but something to note is that I've been doing equal testing
with this on 4.4, which exhibits the same behavior, and that kernel doesn't have
watermark boosting in it, as far as I can tell.

I don't think what we're addressing here is a "bug", but rather something
fundamental about how we've been thinking about kswapd lifetime. The argument
here is that it's not coherent to be letting kswapd run as it does, and instead
gating it on outstanding allocation requests provides much more reasonable
behavior, given real workloads and use patterns.

Does that make sense and seem reasonable?

Sultan

2020-02-19 22:46:30

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed, Feb 19, 2020 at 11:26:53AM -0800, Andrew Morton wrote:
> On Wed, 19 Feb 2020 10:25:22 -0800 Sultan Alsawaf <[email protected]> wrote:
>
> > Keeping kswapd running when all the failed allocations that invoked it
> > are satisfied incurs a high overhead due to unnecessary page eviction
> > and writeback, as well as spurious VM pressure events to various
> > registered shrinkers. When kswapd doesn't need to work to make an
> > allocation succeed anymore, stop it prematurely to save resources.
>
> Seems sensible.
>
> Please fully describe the userspace-visible runtime effects of this
> change?
>

FWIW, it looks like the refcount API doesn't allow refcounts to be zero, so I'd
have to update this patch to use just plain atomic_t instead. But since there
doesn't seem to be much interest in this, I'll only send an updated patch if
it's desired.

Sultan

2020-02-20 08:29:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed 19-02-20 12:42:20, Sultan Alsawaf wrote:
> On Wed, Feb 19, 2020 at 09:05:27PM +0100, Michal Hocko wrote:
[...]
> > Again, do you have more details about the workload and what was the
> > cause of responsiveness issues? Because I would expect that the
> > situation would be quite opposite because it is usually the direct
> > reclaim that is a source of stalls visible from userspace. Or is this
> > about a single CPU situation where kswapd saturates the single CPU and
> > all other tasks are just not getting enough CPU cycles?
>
> The workload was having lots of applications open at once. At a certain point
> when memory ran low, my system became sluggish and kswapd CPU usage skyrocketed.

Could you provide more details please? Is kswapd making a forward
progress? Have you checked why other precesses are slugish? They do not
get CPU time or they are blocked on something?

> I added printks into kswapd with this patch, and my premature exit in kswapd
> kicked in quite often.
>
> > > On systems with more memory I tested (>=4G), kswapd becomes more expensive to
> > > run at its higher scan depths, so stopping kswapd prematurely when there aren't
> > > any memory allocations waiting for it prevents it from reaching the *really*
> > > expensive scan depths and burning through even more resources.
> > >
> > > Combine a large amount of memory with a slow CPU and the current problematic
> > > behavior of kswapd at high memory pressure shows. My personal test scenario for
> > > this was an arm64 CPU with a variable amount of memory (up to 4G RAM + 2G swap).
> >
> > But still, somebody has to put the system into balanced state so who is
> > going to do all the work?
>
> All the work will be done by kswapd of course, but only if it's needed.
>
> The real problem is that a single memory allocation failure, and free memory
> being some amount below the high watermark, are not good heuristics to predict
> *future* memory allocation needs. They are good for determining how to steer
> kswapd to help satisfy a failed allocation in the present, but anything more is
> pure speculation (which turns out to be wrong speculation, since this behavior
> causes problems).

Well, you might be right that there might be better heuristics than the
existing watermark based one. After all nobody can predict the future.
The existing heuristic aims at providing min_free_kbytes of free memory
as much as possible and that tends to work reasonably well for a large
set of workloads.

> If there are outstanding failed allocations that won't go away, then it's
> perfectly reasonable to keep kswapd running until it frees pages up to the high
> watermark. But beyond that is unnecessary, since there's no way to know if or
> when kswapd will need to fire up again. This makes sense considering how kswapd
> is currently invoked: it's fired up due to a failed allocation of some sort, not
> because the amount of free memory dropped below the high watermark.

Very broadly speaking (sorry if I am stating obvious here), the kswapd
is woken up when the allocator hits low watermark or the reguested high
order pages are depleted. Then allocator enters its slow path. That
means that the background reclaim then aims at reclaiming the high-low
watermark gap or invokes compaction to keep the balance. It takes to
consume that gap to wake the kswapd again for order-0 (most common)
requests. So this is usually not about a single allocation to trigger
the background reclaim and counting failures on low watermark attempts
is unlikely to work with the current code as you suggested.
--
Michal Hocko
SUSE Labs

2020-02-20 10:20:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed, Feb 19, 2020 at 02:42:31PM -0800, Sultan Alsawaf wrote:
> On Wed, Feb 19, 2020 at 09:45:13PM +0000, Mel Gorman wrote:
> > This could be watermark boosting run wild again. Can you test with
> > sysctl vm.watermark_boost_factor=0 or the following patch? (preferably
> > both to compare and contrast).
>
> I can test that, but something to note is that I've been doing equal testing
> with this on 4.4, which exhibits the same behavior, and that kernel doesn't have
> watermark boosting in it, as far as I can tell.
>
> I don't think what we're addressing here is a "bug", but rather something
> fundamental about how we've been thinking about kswapd lifetime. The argument
> here is that it's not coherent to be letting kswapd run as it does, and instead
> gating it on outstanding allocation requests provides much more reasonable
> behavior, given real workloads and use patterns.
>
> Does that make sense and seem reasonable?
>

I'm not entirely convinced. The reason the high watermark exists is to have
kswapd work long enough to make progress without a process having to direct
reclaim. The most straight-forward example would be a streaming reader of
a large file. It'll keep pushing the zone towards the low watermark and
kswapd has to keep ahead of the reader. If we cut kswapd off too quickly,
the min watermark is hit and stalls occur. While kswapd could stop at the
min watermark, it leaves a very short window for kswapd to make enough
progress before the min watermark is hit.

At minimum, any change in this area would need to include the /proc/vmstats
on allocstat and pg*direct* to ensure that direct reclaim stalls are
not worse.

I'm not a fan of the patch in question because kswapd can be woken between
the low and min watermark without stalling but we really do expect kswapd
to make progress and continue to make progress to avoid future stalls. The
changelog had no information on the before/after impact of the patch and
this is an area where intuition can disagree with real behaviour.

--
Mel Gorman
SUSE Labs

2020-02-21 04:23:23

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Thu, Feb 20, 2020 at 10:19:45AM +0000, Mel Gorman wrote:
> I'm not entirely convinced. The reason the high watermark exists is to have
> kswapd work long enough to make progress without a process having to direct
> reclaim. The most straight-forward example would be a streaming reader of
> a large file. It'll keep pushing the zone towards the low watermark and
> kswapd has to keep ahead of the reader. If we cut kswapd off too quickly,
> the min watermark is hit and stalls occur. While kswapd could stop at the
> min watermark, it leaves a very short window for kswapd to make enough
> progress before the min watermark is hit.
>
> At minimum, any change in this area would need to include the /proc/vmstats
> on allocstat and pg*direct* to ensure that direct reclaim stalls are
> not worse.
>
> I'm not a fan of the patch in question because kswapd can be woken between
> the low and min watermark without stalling but we really do expect kswapd
> to make progress and continue to make progress to avoid future stalls. The
> changelog had no information on the before/after impact of the patch and
> this is an area where intuition can disagree with real behaviour.
>
> --
> Mel Gorman
> SUSE Labs

Okay, then let's test real behavior.

I fired up my i5-8265U laptop with vanilla linux-next and passed mem=2G on the
command line. After boot up, I opened up chromium and played a video on YouTube.
Immediately after the video started, my laptop completely froze for a few
seconds; then, a few seconds later, my cursor began to respond again, but moving
it around was very laggy. The audio from the video playing was choppy during
this time. About 15-20 seconds after I had started the YouTube video, my system
finally stopped lagging.

Then I tried again with my patch applied (albeit a correct version that doesn't
use the refcount API). Upon starting the same YouTube video in chromium, my
laptop didn't freeze or stutter at all. The cursor was responsive and there was
no stuttering, or choppy audio.

I tested this multiple times with reproducible results each time.

I will attach a functional v2 of the patch that I used.

Sultan

2020-02-21 04:32:34

by Sultan Alsawaf

[permalink] [raw]
Subject: [PATCH v2] mm: Stop kswapd early when nothing's waiting for it to free pages

From: Sultan Alsawaf <[email protected]>

Keeping kswapd running when all the failed allocations that invoked it
are satisfied incurs a high overhead due to unnecessary page eviction
and writeback, as well as spurious VM pressure events to various
registered shrinkers. When kswapd doesn't need to work to make an
allocation succeed anymore, stop it prematurely to save resources.

Signed-off-by: Sultan Alsawaf <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 17 ++++++++++++++---
mm/vmscan.c | 3 ++-
3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..23861cdaab7f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -735,6 +735,7 @@ typedef struct pglist_data {
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
+ atomic_t kswapd_waiters;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..923b994c38c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4401,6 +4401,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int no_progress_loops;
unsigned int cpuset_mems_cookie;
int reserve_flags;
+ pg_data_t *pgdat = ac->preferred_zoneref->zone->zone_pgdat;
+ bool woke_kswapd = false;

/*
* We also sanity check to catch abuse of atomic reserves being used by
@@ -4434,8 +4436,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (!ac->preferred_zoneref->zone)
goto nopage;

- if (alloc_flags & ALLOC_KSWAPD)
+ if (alloc_flags & ALLOC_KSWAPD) {
+ if (!woke_kswapd) {
+ atomic_inc(&pgdat->kswapd_waiters);
+ woke_kswapd = true;
+ }
wake_all_kswapds(order, gfp_mask, ac);
+ }

/*
* The adjusted alloc_flags might result in immediate success, so try
@@ -4640,9 +4647,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto retry;
}
fail:
- warn_alloc(gfp_mask, ac->nodemask,
- "page allocation failure: order:%u", order);
got_pg:
+ if (woke_kswapd)
+ atomic_dec(&pgdat->kswapd_waiters);
+ if (!page)
+ warn_alloc(gfp_mask, ac->nodemask,
+ "page allocation failure: order:%u", order);
return page;
}

@@ -6711,6 +6721,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
pgdat_page_ext_init(pgdat);
spin_lock_init(&pgdat->lru_lock);
lruvec_init(&pgdat->__lruvec);
+ pgdat->kswapd_waiters = (atomic_t)ATOMIC_INIT(0);
}

static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05eb9efec07..59d9f3dd14f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3694,7 +3694,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
__fs_reclaim_release();
ret = try_to_freeze();
__fs_reclaim_acquire();
- if (ret || kthread_should_stop())
+ if (ret || kthread_should_stop() ||
+ !atomic_read(&pgdat->kswapd_waiters))
break;

/*
--
2.25.1

2020-02-21 08:08:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Thu 20-02-20 20:22:32, Sultan Alsawaf wrote:
> On Thu, Feb 20, 2020 at 10:19:45AM +0000, Mel Gorman wrote:
> > I'm not entirely convinced. The reason the high watermark exists is to have
> > kswapd work long enough to make progress without a process having to direct
> > reclaim. The most straight-forward example would be a streaming reader of
> > a large file. It'll keep pushing the zone towards the low watermark and
> > kswapd has to keep ahead of the reader. If we cut kswapd off too quickly,
> > the min watermark is hit and stalls occur. While kswapd could stop at the
> > min watermark, it leaves a very short window for kswapd to make enough
> > progress before the min watermark is hit.
> >
> > At minimum, any change in this area would need to include the /proc/vmstats
> > on allocstat and pg*direct* to ensure that direct reclaim stalls are
> > not worse.
> >
> > I'm not a fan of the patch in question because kswapd can be woken between
> > the low and min watermark without stalling but we really do expect kswapd
> > to make progress and continue to make progress to avoid future stalls. The
> > changelog had no information on the before/after impact of the patch and
> > this is an area where intuition can disagree with real behaviour.
> >
> > --
> > Mel Gorman
> > SUSE Labs
>
> Okay, then let's test real behavior.
>
> I fired up my i5-8265U laptop with vanilla linux-next and passed mem=2G on the
> command line. After boot up, I opened up chromium and played a video on YouTube.
> Immediately after the video started, my laptop completely froze for a few
> seconds; then, a few seconds later, my cursor began to respond again, but moving
> it around was very laggy. The audio from the video playing was choppy during
> this time. About 15-20 seconds after I had started the YouTube video, my system
> finally stopped lagging.

Could you provide regular (e.g. each second) snapshots of /proc/vmstat,
ideally started before and finished after the observed behavior?
Something like
while true
do
cp /proc/vmstat vmstat.$(date +%s)
done

If you can perf record and see where the kernel spends time during that
time period then it would be really helpful as well.

> Then I tried again with my patch applied (albeit a correct version that doesn't
> use the refcount API). Upon starting the same YouTube video in chromium, my
> laptop didn't freeze or stutter at all. The cursor was responsive and there was
> no stuttering, or choppy audio.
>
> I tested this multiple times with reproducible results each time.

Your patch might be simply papering over a real problem.

> I will attach a functional v2 of the patch that I used.
>
> Sultan

--
Michal Hocko
SUSE Labs

2020-02-21 18:06:15

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Thu, Feb 20, 2020 at 8:22 PM Sultan Alsawaf <[email protected]> wrote:
>
> On Thu, Feb 20, 2020 at 10:19:45AM +0000, Mel Gorman wrote:
> > I'm not entirely convinced. The reason the high watermark exists is to have
> > kswapd work long enough to make progress without a process having to direct
> > reclaim. The most straight-forward example would be a streaming reader of
> > a large file. It'll keep pushing the zone towards the low watermark and
> > kswapd has to keep ahead of the reader. If we cut kswapd off too quickly,
> > the min watermark is hit and stalls occur. While kswapd could stop at the
> > min watermark, it leaves a very short window for kswapd to make enough
> > progress before the min watermark is hit.
> >
> > At minimum, any change in this area would need to include the /proc/vmstats
> > on allocstat and pg*direct* to ensure that direct reclaim stalls are
> > not worse.
> >
> > I'm not a fan of the patch in question because kswapd can be woken between
> > the low and min watermark without stalling but we really do expect kswapd
> > to make progress and continue to make progress to avoid future stalls. The
> > changelog had no information on the before/after impact of the patch and
> > this is an area where intuition can disagree with real behaviour.
> >
> > --
> > Mel Gorman
> > SUSE Labs
>
> Okay, then let's test real behavior.
>
> I fired up my i5-8265U laptop with vanilla linux-next and passed mem=2G on the
> command line. After boot up, I opened up chromium and played a video on YouTube.
> Immediately after the video started, my laptop completely froze for a few
> seconds; then, a few seconds later, my cursor began to respond again, but moving
> it around was very laggy. The audio from the video playing was choppy during
> this time. About 15-20 seconds after I had started the YouTube video, my system
> finally stopped lagging.
>
> Then I tried again with my patch applied (albeit a correct version that doesn't
> use the refcount API). Upon starting the same YouTube video in chromium, my
> laptop didn't freeze or stutter at all. The cursor was responsive and there was
> no stuttering, or choppy audio.
>
> I tested this multiple times with reproducible results each time.
>
> I will attach a functional v2 of the patch that I used.
>

Can you also attach the /proc/zoneinfo? Maybe the watermarks are too high.

2020-02-21 20:02:15

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [PATCH v2] mm: Stop kswapd early when nothing's waiting for it to free pages

On Fri, Feb 21, 2020 at 10:22:02AM -0800, Ira Weiny wrote:
> On Thu, Feb 20, 2020 at 08:30:52PM -0800, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf <[email protected]>
>
> [snip]
>
> > @@ -4640,9 +4647,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > goto retry;
> > }
> > fail:
> > - warn_alloc(gfp_mask, ac->nodemask,
> > - "page allocation failure: order:%u", order);
> > got_pg:
>
> I have no insight into if this is masking a deeper problem or if this fixes
> something but doesn't the above result in 'fail' and 'got_pg' being the same
> label?
>
> Ira
>
> > + if (woke_kswapd)
> > + atomic_dec(&pgdat->kswapd_waiters);
> > + if (!page)
> > + warn_alloc(gfp_mask, ac->nodemask,
> > + "page allocation failure: order:%u", order);
> > return page;
> > }
>
> [snip]

Yes,. This was to reduce the patch delta for the initial submission so it's
clearer what's going on; it can be altered of course to coalesce the labels into
a single one. I typically produce my patches to upstream components to be as
uninvasive as possible to aid in backporting and forward porting in case it's
rejected and I want to keep it for myself.

Sultan

2020-02-21 20:07:48

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Fri, Feb 21, 2020 at 10:04:38AM -0800, Shakeel Butt wrote:
> Can you also attach the /proc/zoneinfo? Maybe the watermarks are too high.

Here it is (when booted with mem=2G).

Sultan

Node 0, zone DMA
per-node stats
nr_inactive_anon 236
nr_active_anon 47914
nr_inactive_file 44716
nr_active_file 22397
nr_unevictable 11665
nr_slab_reclaimable 11217
nr_slab_unreclaimable 16465
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 0
workingset_refault 0
workingset_activate 0
workingset_restore 0
workingset_nodereclaim 0
nr_anon_pages 47716
nr_mapped 34753
nr_file_pages 79237
nr_dirty 66
nr_writeback 0
nr_writeback_temp 0
nr_shmem 12120
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 0
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_unstable 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 3143
nr_written 3073
nr_kernel_misc_reclaimable 0
pages free 3975
min 173
low 216
high 259
spanned 4095
present 3998
managed 3975
protection: (0, 992, 992, 992, 992)
nr_free_pages 3975
nr_zone_inactive_anon 0
nr_zone_active_anon 0
nr_zone_inactive_file 0
nr_zone_active_file 0
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_bounce 0
nr_zspages 0
nr_free_cma 0
numa_hit 0
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 0
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 8
node_unreclaimable: 0
start_pfn: 1
Node 0, zone DMA32
pages free 87925
min 11090
low 13862
high 16634
spanned 294144
present 270018
managed 257874
protection: (0, 0, 0, 0, 0)
nr_free_pages 87925
nr_zone_inactive_anon 236
nr_zone_active_anon 47914
nr_zone_inactive_file 44716
nr_zone_active_file 22397
nr_zone_unevictable 11665
nr_zone_write_pending 66
nr_mlock 16
nr_page_table_pages 1284
nr_kernel_stack 5024
nr_bounce 0
nr_zspages 0
nr_free_cma 0
numa_hit 544686
numa_miss 0
numa_foreign 0
numa_interleave 8149
numa_local 544686
numa_other 0
pagesets
cpu: 0
count: 358
high: 378
batch: 63
vm stats threshold: 32
cpu: 1
count: 335
high: 378
batch: 63
vm stats threshold: 32
cpu: 2
count: 308
high: 378
batch: 63
vm stats threshold: 32
cpu: 3
count: 337
high: 378
batch: 63
vm stats threshold: 32
cpu: 4
count: 326
high: 378
batch: 63
vm stats threshold: 32
cpu: 5
count: 315
high: 378
batch: 63
vm stats threshold: 32
cpu: 6
count: 370
high: 378
batch: 63
vm stats threshold: 32
cpu: 7
count: 292
high: 378
batch: 63
vm stats threshold: 32
node_unreclaimable: 0
start_pfn: 4096
Node 0, zone Normal
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 0, 0, 0)
Node 0, zone Movable
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 0, 0, 0)
Node 0, zone Device
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 0, 0, 0)

2020-02-21 21:25:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On 2/21/20 1:08 PM, Sultan Alsawaf wrote:
> Both of these logs are attached in a tarball.

Could you post your vmlinux somewhere? It's hard to do much with
perf.data without it. The other option is to do a 'perf report' on your
system and send the resulting text output.

2020-02-25 09:11:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote:
[...]
> Both of these logs are attached in a tarball.

Thanks! First of all
$ grep pswp vmstat.1582318979
pswpin 0
pswpout 0

suggests that you do not have any swap storage, right? I will get back
to this later. Now, let's have a look at snapshots. We have regular 1s
snapshots intially but then we have
vmstat.1582318734
vmstat.1582318736
vmstat.1582318758
vmstat.1582318763
vmstat.1582318768
[...]
vmstat.1582318965
vmstat.1582318975
vmstat.1582318976

That is 242s time period when even a simple bash script was struggling
to write a snapshot of a /proc/vmstat which by itself shouldn't really
depend on the system activity much. Let's have a look at a random chosen
two consecutive snapshots from this time period:

vmstat.1582318736 vmstat.1582318758
base diff
allocstall_dma 0 0
allocstall_dma32 0 0
allocstall_movable 5773 0
allocstall_normal 906 0

to my surprise there was no invocation of the direct reclaim in this
time period. I would expect this to be the case considering the long
stall. But the source of the stall might be different than the DR.

compact_stall 13 1

Direct compaction has been invoked but this shouldn't cause a major
stall for all processes.

nr_active_anon 133932 236
nr_inactive_anon 9350 -1179
nr_active_file 318 190
nr_inactive_file 673 56
nr_unevictable 51984 0

The amount of anonymous memory is not really high (~560MB) but file LRU
is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to
~760M of memory which is 74% of the memory. Please note that your mem=2G
setup gives you only 1G of memory in fact (based on the zone_info you
have posted). That is not something unusual but the amount of the page
cache is worrying because I would expect a heavy trashing because most
of the executables are going to require major faults. Anonymous memory
is not swapped out obviously so there is no other option than to refault
constantly.

pgscan_kswapd 64788716 14157035
pgsteal_kswapd 29378868 4393216
pswpin 0 0
pswpout 0 0
workingset_activate 3840226 169674
workingset_refault 29396942 4393013
workingset_restore 2883042 106358

And here we can see it clearly happening. Note how working set refaults
matches the amount of memory reclaimed by kswapd.

I would be really curious whether adding swap space would help some.

Now to your patch and why it helps here. It seems quite obvious that the
only effectively reclaimable memory (page cache) is not going to satisfy
the high watermark target
Node 0, zone DMA32
pages free 87925
min 11090
low 13862
high 16634

kswapd has some feedback mechanism to back off when the zone is hopless
from the reclaim point of view AFAIR but it seems it has failed in this
particular situation. It should have relied on the direct reclaim and
eventually trigger the OOM killer. Your patch has worked around this by
bailing out from the kswapd reclaim too early so a part of the page
cache required for the code to move on would stay resident and move
further.

The proper fix should, however, check the amount of reclaimable pages
and back off if they cannot meet the target IMO. We cannot rely on the
general reclaimability here because that could really be thrashing.

Thoughts?
--
Michal Hocko
SUSE Labs

2020-02-25 17:27:40

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Tue, Feb 25, 2020 at 10:09:45AM +0100, Michal Hocko wrote:
> On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote:
> [...]
> > Both of these logs are attached in a tarball.
>
> Thanks! First of all
> $ grep pswp vmstat.1582318979
> pswpin 0
> pswpout 0
>
> suggests that you do not have any swap storage, right?

Correct. I'm not using any swap (and it should not be necessary to make Linux mm
work of course). If I were to divide my RAM in half and use one half as swap,
do you think the results would be different? IMO they shouldn't be.

> The amount of anonymous memory is not really high (~560MB) but file LRU
> is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to
> ~760M of memory which is 74% of the memory. Please note that your mem=2G
> setup gives you only 1G of memory in fact (based on the zone_info you
> have posted). That is not something unusual but the amount of the page
> cache is worrying because I would expect a heavy trashing because most
> of the executables are going to require major faults. Anonymous memory
> is not swapped out obviously so there is no other option than to refault
> constantly.

I noticed that only 1G was available as well. Perhaps direct reclaim wasn't
attempted due to the zone_reclaimable_pages() check, though I don't think direct
reclaim would've been particularly helpful in this case (see below).

> kswapd has some feedback mechanism to back off when the zone is hopless
> from the reclaim point of view AFAIR but it seems it has failed in this
> particular situation. It should have relied on the direct reclaim and
> eventually trigger the OOM killer. Your patch has worked around this by
> bailing out from the kswapd reclaim too early so a part of the page
> cache required for the code to move on would stay resident and move
> further.
>
> The proper fix should, however, check the amount of reclaimable pages
> and back off if they cannot meet the target IMO. We cannot rely on the
> general reclaimability here because that could really be thrashing.

Yes, my guess was that thrashing out pages used by the running programs was the
cause for my freezes, but I didn't think of making kswapd back off a different
way.

Right now I don't see any such back-off mechanism in kswapd. Also, if we add
this into kswapd, we would need to plug it into the direct reclaim path as well,
no? I don't think direct reclaim would help with the situation I've run into;
although it wouldn't be as bad as letting kswapd evict pages to the high
watermark, it would still cause page thrashing that would just be capped to the
amount of pages a direct reclaimer is looking to steal.

Considering that my patch remedies this issue for me without invoking the OOM
killer, a proper solution should produce the same or better results. I don't
think the OOM killer should have been triggered in this case.

Sultan

2020-02-25 22:31:42

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Tue, Feb 25, 2020 at 1:10 AM Michal Hocko <[email protected]> wrote:
>
[snip]
>
> The proper fix should, however, check the amount of reclaimable pages
> and back off if they cannot meet the target IMO. We cannot rely on the
> general reclaimability here because that could really be thrashing.
>

"check the amount of reclaimable pages" vs "cannot rely on the general
reclaimability"? Can you clarify?

BTW we are seeing a similar situation in our production environment.
We have swappiness=0, no swap from kswapd (because we don't swapout on
pressure, only on cold age) and too few file pages, the kswapd goes
crazy on shrink_slab and spends 100% cpu on it.

Shakeel

2020-02-26 09:06:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Tue 25-02-20 09:12:42, Sultan Alsawaf wrote:
> On Tue, Feb 25, 2020 at 10:09:45AM +0100, Michal Hocko wrote:
> > On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote:
> > [...]
> > > Both of these logs are attached in a tarball.
> >
> > Thanks! First of all
> > $ grep pswp vmstat.1582318979
> > pswpin 0
> > pswpout 0
> >
> > suggests that you do not have any swap storage, right?
>
> Correct. I'm not using any swap (and it should not be necessary to make Linux mm
> work of course).

In your particular situation you simply overcommit the memory way too
much AFAICS. Look at the vmstat data
vmstat.1582318758
nr_free_pages 16292
nr_page_table_pages 2822
nr_inactive_anon 8171
nr_active_anon 134168
nr_inactive_file 729
nr_active_file 508
nr_unevictable 51984
nr_slab_reclaimable 10919
nr_slab_unreclaimable 19766

This is roughly memory that we have clear accounting for (well except
for hugetlb pages but you do not seem to be using those). This is
82% of the memory. The rest is used by different kernel subsystems.
Of that only 47MB is reclaimable without invoking the OOM killer (if we
include nr_slab_reclaimable which might be quite hard to reclaim
effectively). I can imagine that could work only if the page cache
footprint was negligible but that doesn't seem to be the case here.
If you add swap storage then the math is completely different so yes the
swap is likely going to make a difference here.

> If I were to divide my RAM in half and use one half as swap,
> do you think the results would be different? IMO they shouldn't be.

That really depends on what is the swap backed memory footprint.

> > The amount of anonymous memory is not really high (~560MB) but file LRU
> > is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to
> > ~760M of memory which is 74% of the memory. Please note that your mem=2G
> > setup gives you only 1G of memory in fact (based on the zone_info you
> > have posted). That is not something unusual but the amount of the page
> > cache is worrying because I would expect a heavy trashing because most
> > of the executables are going to require major faults. Anonymous memory
> > is not swapped out obviously so there is no other option than to refault
> > constantly.
>
> I noticed that only 1G was available as well.

This is because of your physical memory layout. mem=2G doesn't restrict
the amount of the memory to 2G but rather the top physical address to
2G. If you have holes in the memory layout you are likely to get much
less memory.

> Perhaps direct reclaim wasn't
> attempted due to the zone_reclaimable_pages() check, though I don't think direct
> reclaim would've been particularly helpful in this case (see below).

there are reclaimable pages so zone_reclaimable_pages shouldn't really
stop DR for the zone Normal.

> > kswapd has some feedback mechanism to back off when the zone is hopless
> > from the reclaim point of view AFAIR but it seems it has failed in this
> > particular situation. It should have relied on the direct reclaim and
> > eventually trigger the OOM killer. Your patch has worked around this by
> > bailing out from the kswapd reclaim too early so a part of the page
> > cache required for the code to move on would stay resident and move
> > further.
> >
> > The proper fix should, however, check the amount of reclaimable pages
> > and back off if they cannot meet the target IMO. We cannot rely on the
> > general reclaimability here because that could really be thrashing.
>
> Yes, my guess was that thrashing out pages used by the running programs was the
> cause for my freezes, but I didn't think of making kswapd back off a different
> way.
>
> Right now I don't see any such back-off mechanism in kswapd.

There is pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES check in prepare_kswapd_sleep
but this is not helping really, quite contrary, because kswapd is able
to reclaim page cache. The problem is that we even try to reclaim that
page cache I believe. If we have a signal that all the reclaimed memory
is essentially refaulted then we should backoff.

A simpler and probably much more subtle solution would be to back off
kswapd when zone_reclaimable_pages() < high_wmark_pages. This would push
out the work to the direct reclaim which itself is changing the overall
timing as the reclaim would be more bound to the memory demand.

> Also, if we add
> this into kswapd, we would need to plug it into the direct reclaim path as well,
> no? I don't think direct reclaim would help with the situation I've run into;
> although it wouldn't be as bad as letting kswapd evict pages to the high
> watermark, it would still cause page thrashing that would just be capped to the
> amount of pages a direct reclaimer is looking to steal.

And more importantly it would be bound to the allocating context so the
feedback mechanism would be more bound to the workload.

> Considering that my patch remedies this issue for me without invoking the OOM
> killer, a proper solution should produce the same or better results. I don't
> think the OOM killer should have been triggered in this case.

Your system is likely struggling already, it is just less visible
because kswapd reclaim is essentially faster than paging in so a single
access might trigger several refaults.
--
Michal Hocko
SUSE Labs

2020-02-26 09:10:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Tue 25-02-20 14:30:03, Shakeel Butt wrote:
> On Tue, Feb 25, 2020 at 1:10 AM Michal Hocko <[email protected]> wrote:
> >
> [snip]
> >
> > The proper fix should, however, check the amount of reclaimable pages
> > and back off if they cannot meet the target IMO. We cannot rely on the
> > general reclaimability here because that could really be thrashing.
> >
>
> "check the amount of reclaimable pages" vs "cannot rely on the general
> reclaimability"? Can you clarify?

kswapd targets the high watermark and if your reclaimable memory (aka
zone_reclaimable_pages) is lower than the high wmark then it cannot
simply satisfy that target, right? Keeping reclaim in that situations
seems counter productive to me because you keep evicting pages that
might be reused without any feedback mechanism on the actual usage.
Please see my other reply.

> BTW we are seeing a similar situation in our production environment.
> We have swappiness=0, no swap from kswapd (because we don't swapout on
> pressure, only on cold age) and too few file pages, the kswapd goes
> crazy on shrink_slab and spends 100% cpu on it.

I am not sure this is the same problem. It seems that the slab shrinkers
are not really a bottle neck here. I would recommend you to identify
which shrinkers are eating the cpu time in your case.
--
Michal Hocko
SUSE Labs

2020-02-26 17:01:43

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed, Feb 26, 2020 at 1:08 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 25-02-20 14:30:03, Shakeel Butt wrote:
> > On Tue, Feb 25, 2020 at 1:10 AM Michal Hocko <[email protected]> wrote:
> > >
> > [snip]
> > >
> > > The proper fix should, however, check the amount of reclaimable pages
> > > and back off if they cannot meet the target IMO. We cannot rely on the
> > > general reclaimability here because that could really be thrashing.
> > >
> >
> > "check the amount of reclaimable pages" vs "cannot rely on the general
> > reclaimability"? Can you clarify?
>
> kswapd targets the high watermark and if your reclaimable memory (aka
> zone_reclaimable_pages) is lower than the high wmark then it cannot
> simply satisfy that target, right? Keeping reclaim in that situations
> seems counter productive to me because you keep evicting pages that
> might be reused without any feedback mechanism on the actual usage.
> Please see my other reply.
>

I understand and agree with the argument that if reclaimable pages are
less than high wmark then no need to reclaim. Regarding not depending
on general reclaimability, I thought you meant that even if
reclaimable pages are over high wmark, we might not want to continue
the reclaim to not cause thrashing. Is that right?

> > BTW we are seeing a similar situation in our production environment.
> > We have swappiness=0, no swap from kswapd (because we don't swapout on
> > pressure, only on cold age) and too few file pages, the kswapd goes
> > crazy on shrink_slab and spends 100% cpu on it.
>
> I am not sure this is the same problem. It seems that the slab shrinkers
> are not really a bottle neck here. I would recommend you to identify
> which shrinkers are eating the cpu time in your case.
>

The perf profile shows that the kswapd is spending almost all its time
in list_lru_count_one and memcg tree traversal. So, it's not just one
shrinker.

Yes, it's not exactly the same problem but I would say it is similar.
For Sultan's issue, even if there are many reclaimable pages, we don't
want to thrash. In this issue, thrashing is not happening but kswapd
is going nuts on slab shrinkers.

Shakeel

2020-02-26 17:05:13

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed, Feb 26, 2020 at 4:15 AM Hillf Danton <[email protected]> wrote:
>
>
> On Tue, 25 Feb 2020 14:30:03 -0800 Shakeel Butt wrote:
> >
> > BTW we are seeing a similar situation in our production environment.
> > We have swappiness=0, no swap from kswapd (because we don't swapout on
> > pressure, only on cold age) and too few file pages, the kswapd goes
> > crazy on shrink_slab and spends 100% cpu on it.
>
> Dunno if swappiness is able to put peace on your kswapd.
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2631,8 +2631,14 @@ static inline bool should_continue_recla
> */
> pages_for_compaction = compact_gap(sc->order);
> inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> - if (get_nr_swap_pages() > 0)
> - inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> + do {
> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> + int swappiness = mem_cgroup_swappiness(memcg);
> +
> + if (swappiness && get_nr_swap_pages() > 0)

Thanks for finding this. I think we also need to check sc->may_swap as
well. Can you please send a signed-off patch? It may or maynot help
kswapd but I think this is needed.

> + inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> + } while (0);
>
> return inactive_lru_pages > pages_for_compaction;
> }
>

2020-02-26 17:42:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

On Wed 26-02-20 09:00:57, Shakeel Butt wrote:
> On Wed, Feb 26, 2020 at 1:08 AM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 25-02-20 14:30:03, Shakeel Butt wrote:
> > > On Tue, Feb 25, 2020 at 1:10 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > [snip]
> > > >
> > > > The proper fix should, however, check the amount of reclaimable pages
> > > > and back off if they cannot meet the target IMO. We cannot rely on the
> > > > general reclaimability here because that could really be thrashing.
> > > >
> > >
> > > "check the amount of reclaimable pages" vs "cannot rely on the general
> > > reclaimability"? Can you clarify?
> >
> > kswapd targets the high watermark and if your reclaimable memory (aka
> > zone_reclaimable_pages) is lower than the high wmark then it cannot
> > simply satisfy that target, right? Keeping reclaim in that situations
> > seems counter productive to me because you keep evicting pages that
> > might be reused without any feedback mechanism on the actual usage.
> > Please see my other reply.
> >
>
> I understand and agree with the argument that if reclaimable pages are
> less than high wmark then no need to reclaim. Regarding not depending
> on general reclaimability, I thought you meant that even if
> reclaimable pages are over high wmark, we might not want to continue
> the reclaim to not cause thrashing. Is that right?

That is a completely different story. I would stick with the pathological
problem reported here. General threshing problem is much more complex
and harder to provide a solution for without introducing a lot of policy
into the reclaim.
--
Michal Hocko
SUSE Labs