2018-07-31 11:10:43

by Zhaoyang Huang

[permalink] [raw]
Subject: [PATCH v2] mm: terminate the reclaim early when direct reclaiming

This patch try to let the direct reclaim finish earlier than it used
to be. The problem comes from We observing that the direct reclaim
took a long time to finish when memcg is enabled. By debugging, we
find that the reason is the softlimit is too low to meet the loop
end criteria. So we add two barriers to judge if it has reclaimed
enough memory as same criteria as it is in shrink_lruvec:
1. for each memcg softlimit reclaim.
2. before starting the global reclaim in shrink_zone.

Signed-off-by: Zhaoyang Huang <[email protected]>
---
include/linux/memcontrol.h | 3 ++-
mm/memcontrol.c | 3 +++
mm/vmscan.c | 38 +++++++++++++++++++++++++++++++++++++-
3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c6fb11..a7e82c7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -325,7 +325,8 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
void mem_cgroup_uncharge_list(struct list_head *page_list);

void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
-
+bool direct_reclaim_reach_watermark(pg_data_t *pgdat, unsigned long nr_reclaimed,
+ unsigned long nr_scanned, gfp_t gfp_mask, int order);
static struct mem_cgroup_per_node *
mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8c0280b..e4efd46 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2577,6 +2577,9 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
(next_mz == NULL ||
loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
break;
+ if (direct_reclaim_reach_watermark(pgdat, nr_reclaimed,
+ *total_scanned, gfp_mask, order))
+ break;
} while (!nr_reclaimed);
if (next_mz)
css_put(&next_mz->memcg->css);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03822f8..19503f3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2518,6 +2518,34 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
(memcg && memcg_congested(pgdat, memcg));
}

+bool direct_reclaim_reach_watermark(pg_data_t *pgdat, unsigned long nr_reclaimed,
+ unsigned long nr_scanned, gfp_t gfp_mask,
+ int order)
+{
+ struct scan_control sc = {
+ .gfp_mask = gfp_mask,
+ .order = order,
+ .priority = DEF_PRIORITY,
+ .nr_reclaimed = nr_reclaimed,
+ .nr_scanned = nr_scanned,
+ };
+ if (!current_is_kswapd())
+ return false;
+ if (!IS_ENABLED(CONFIG_COMPACTION))
+ return false;
+ /*
+ * In fact, we add 1 to nr_reclaimed and nr_scanned to let should_continue_reclaim
+ * NOT return by finding they are zero, which means compaction_suitable()
+ * takes effect here to judge if we have reclaimed enough pages for passing
+ * the watermark and no necessary to check other memcg anymore.
+ */
+ if (!should_continue_reclaim(pgdat,
+ sc.nr_reclaimed + 1, sc.nr_scanned + 1, &sc))
+ return true;
+ return false;
+}
+EXPORT_SYMBOL(direct_reclaim_reach_watermark);
+
static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
{
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2802,7 +2830,15 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
sc->nr_scanned += nr_soft_scanned;
/* need some check for avoid more shrink_zone() */
}
-
+ /*
+ * we maybe have stolen enough pages from soft limit reclaim, so we return
+ * back if we are direct reclaim
+ */
+ if (direct_reclaim_reach_watermark(zone->zone_pgdat, sc->nr_reclaimed,
+ sc->nr_scanned, sc->gfp_mask, sc->order)) {
+ sc->gfp_mask = orig_mask;
+ return;
+ }
/* See comment about same check for global reclaim above */
if (zone->zone_pgdat == last_pgdat)
continue;
--
1.9.1



2018-07-31 11:21:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] mm: terminate the reclaim early when direct reclaiming

On Tue 31-07-18 19:09:28, Zhaoyang Huang wrote:
> This patch try to let the direct reclaim finish earlier than it used
> to be. The problem comes from We observing that the direct reclaim
> took a long time to finish when memcg is enabled. By debugging, we
> find that the reason is the softlimit is too low to meet the loop
> end criteria. So we add two barriers to judge if it has reclaimed
> enough memory as same criteria as it is in shrink_lruvec:
> 1. for each memcg softlimit reclaim.
> 2. before starting the global reclaim in shrink_zone.

Then I would really recommend to not use soft limit at all. It has
always been aggressive. I have propose to make it less so in the past we
have decided to go that way because we simply do not know whether
somebody depends on that behavior. Your changelog doesn't really tell
the whole story. Why is this a problem all of the sudden? Nothing has
really changed recently AFAICT. Cgroup v1 interface is mostly for
backward compatibility, we have much better ways to accomplish
workloads isolation in cgroup v2.

So why does it matter all of the sudden?

Besides that EXPORT_SYMBOL for such a low level functionality as the
memory reclaim is a big no-no.

So without a much better explanation and with a low level symbol
exported NAK from me.

>
> Signed-off-by: Zhaoyang Huang <[email protected]>
> ---
> include/linux/memcontrol.h | 3 ++-
> mm/memcontrol.c | 3 +++
> mm/vmscan.c | 38 +++++++++++++++++++++++++++++++++++++-
> 3 files changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6c6fb11..a7e82c7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -325,7 +325,8 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
> void mem_cgroup_uncharge_list(struct list_head *page_list);
>
> void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
> -
> +bool direct_reclaim_reach_watermark(pg_data_t *pgdat, unsigned long nr_reclaimed,
> + unsigned long nr_scanned, gfp_t gfp_mask, int order);
> static struct mem_cgroup_per_node *
> mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
> {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8c0280b..e4efd46 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2577,6 +2577,9 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> (next_mz == NULL ||
> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> break;
> + if (direct_reclaim_reach_watermark(pgdat, nr_reclaimed,
> + *total_scanned, gfp_mask, order))
> + break;
> } while (!nr_reclaimed);
> if (next_mz)
> css_put(&next_mz->memcg->css);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 03822f8..19503f3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2518,6 +2518,34 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> (memcg && memcg_congested(pgdat, memcg));
> }
>
> +bool direct_reclaim_reach_watermark(pg_data_t *pgdat, unsigned long nr_reclaimed,
> + unsigned long nr_scanned, gfp_t gfp_mask,
> + int order)
> +{
> + struct scan_control sc = {
> + .gfp_mask = gfp_mask,
> + .order = order,
> + .priority = DEF_PRIORITY,
> + .nr_reclaimed = nr_reclaimed,
> + .nr_scanned = nr_scanned,
> + };
> + if (!current_is_kswapd())
> + return false;
> + if (!IS_ENABLED(CONFIG_COMPACTION))
> + return false;
> + /*
> + * In fact, we add 1 to nr_reclaimed and nr_scanned to let should_continue_reclaim
> + * NOT return by finding they are zero, which means compaction_suitable()
> + * takes effect here to judge if we have reclaimed enough pages for passing
> + * the watermark and no necessary to check other memcg anymore.
> + */
> + if (!should_continue_reclaim(pgdat,
> + sc.nr_reclaimed + 1, sc.nr_scanned + 1, &sc))
> + return true;
> + return false;
> +}
> +EXPORT_SYMBOL(direct_reclaim_reach_watermark);
> +
> static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> {
> struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -2802,7 +2830,15 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> sc->nr_scanned += nr_soft_scanned;
> /* need some check for avoid more shrink_zone() */
> }
> -
> + /*
> + * we maybe have stolen enough pages from soft limit reclaim, so we return
> + * back if we are direct reclaim
> + */
> + if (direct_reclaim_reach_watermark(zone->zone_pgdat, sc->nr_reclaimed,
> + sc->nr_scanned, sc->gfp_mask, sc->order)) {
> + sc->gfp_mask = orig_mask;
> + return;
> + }
> /* See comment about same check for global reclaim above */
> if (zone->zone_pgdat == last_pgdat)
> continue;
> --
> 1.9.1

--
Michal Hocko
SUSE Labs

2018-07-31 11:59:34

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH v2] mm: terminate the reclaim early when direct reclaiming

On Tue, Jul 31, 2018 at 7:19 PM Michal Hocko <[email protected]> wrote:
>
> On Tue 31-07-18 19:09:28, Zhaoyang Huang wrote:
> > This patch try to let the direct reclaim finish earlier than it used
> > to be. The problem comes from We observing that the direct reclaim
> > took a long time to finish when memcg is enabled. By debugging, we
> > find that the reason is the softlimit is too low to meet the loop
> > end criteria. So we add two barriers to judge if it has reclaimed
> > enough memory as same criteria as it is in shrink_lruvec:
> > 1. for each memcg softlimit reclaim.
> > 2. before starting the global reclaim in shrink_zone.
>
> Then I would really recommend to not use soft limit at all. It has
> always been aggressive. I have propose to make it less so in the past we
> have decided to go that way because we simply do not know whether
> somebody depends on that behavior. Your changelog doesn't really tell
> the whole story. Why is this a problem all of the sudden? Nothing has
> really changed recently AFAICT. Cgroup v1 interface is mostly for
> backward compatibility, we have much better ways to accomplish
> workloads isolation in cgroup v2.
>
> So why does it matter all of the sudden?
>
> Besides that EXPORT_SYMBOL for such a low level functionality as the
> memory reclaim is a big no-no.
>
> So without a much better explanation and with a low level symbol
> exported NAK from me.
>
My test workload is from Android system, where the multimedia apps
require much pages. We observed that one thread of the process trapped
into mem_cgroup_soft_limit_reclaim within direct reclaim and also
blocked other thread in mmap or do_page_fault(by semphore?).
Furthermore, we also observed other long time direct reclaim related
with soft limit which are supposed to cause page thrash as the
allocator itself is the most right of the rb_tree . Besides, even
without the soft_limit, shall the 'direct reclaim' check the watermark
firstly before shrink_node, for the concurrent kswapd may have
reclaimed enough pages for allocation.
> >
> > Signed-off-by: Zhaoyang Huang <[email protected]>
> > ---
> > include/linux/memcontrol.h | 3 ++-
> > mm/memcontrol.c | 3 +++
> > mm/vmscan.c | 38 +++++++++++++++++++++++++++++++++++++-
> > 3 files changed, 42 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 6c6fb11..a7e82c7 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -325,7 +325,8 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
> > void mem_cgroup_uncharge_list(struct list_head *page_list);
> >
> > void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
> > -
> > +bool direct_reclaim_reach_watermark(pg_data_t *pgdat, unsigned long nr_reclaimed,
> > + unsigned long nr_scanned, gfp_t gfp_mask, int order);
> > static struct mem_cgroup_per_node *
> > mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
> > {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 8c0280b..e4efd46 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2577,6 +2577,9 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> > (next_mz == NULL ||
> > loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> > break;
> > + if (direct_reclaim_reach_watermark(pgdat, nr_reclaimed,
> > + *total_scanned, gfp_mask, order))
> > + break;
> > } while (!nr_reclaimed);
> > if (next_mz)
> > css_put(&next_mz->memcg->css);
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 03822f8..19503f3 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2518,6 +2518,34 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> > (memcg && memcg_congested(pgdat, memcg));
> > }
> >
> > +bool direct_reclaim_reach_watermark(pg_data_t *pgdat, unsigned long nr_reclaimed,
> > + unsigned long nr_scanned, gfp_t gfp_mask,
> > + int order)
> > +{
> > + struct scan_control sc = {
> > + .gfp_mask = gfp_mask,
> > + .order = order,
> > + .priority = DEF_PRIORITY,
> > + .nr_reclaimed = nr_reclaimed,
> > + .nr_scanned = nr_scanned,
> > + };
> > + if (!current_is_kswapd())
> > + return false;
> > + if (!IS_ENABLED(CONFIG_COMPACTION))
> > + return false;
> > + /*
> > + * In fact, we add 1 to nr_reclaimed and nr_scanned to let should_continue_reclaim
> > + * NOT return by finding they are zero, which means compaction_suitable()
> > + * takes effect here to judge if we have reclaimed enough pages for passing
> > + * the watermark and no necessary to check other memcg anymore.
> > + */
> > + if (!should_continue_reclaim(pgdat,
> > + sc.nr_reclaimed + 1, sc.nr_scanned + 1, &sc))
> > + return true;
> > + return false;
> > +}
> > +EXPORT_SYMBOL(direct_reclaim_reach_watermark);
> > +
> > static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > {
> > struct reclaim_state *reclaim_state = current->reclaim_state;
> > @@ -2802,7 +2830,15 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > sc->nr_scanned += nr_soft_scanned;
> > /* need some check for avoid more shrink_zone() */
> > }
> > -
> > + /*
> > + * we maybe have stolen enough pages from soft limit reclaim, so we return
> > + * back if we are direct reclaim
> > + */
> > + if (direct_reclaim_reach_watermark(zone->zone_pgdat, sc->nr_reclaimed,
> > + sc->nr_scanned, sc->gfp_mask, sc->order)) {
> > + sc->gfp_mask = orig_mask;
> > + return;
> > + }
> > /* See comment about same check for global reclaim above */
> > if (zone->zone_pgdat == last_pgdat)
> > continue;
> > --
> > 1.9.1
>
> --
> Michal Hocko
> SUSE Labs

2018-07-31 12:07:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] mm: terminate the reclaim early when direct reclaiming

On Tue 31-07-18 19:58:20, Zhaoyang Huang wrote:
> On Tue, Jul 31, 2018 at 7:19 PM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 31-07-18 19:09:28, Zhaoyang Huang wrote:
> > > This patch try to let the direct reclaim finish earlier than it used
> > > to be. The problem comes from We observing that the direct reclaim
> > > took a long time to finish when memcg is enabled. By debugging, we
> > > find that the reason is the softlimit is too low to meet the loop
> > > end criteria. So we add two barriers to judge if it has reclaimed
> > > enough memory as same criteria as it is in shrink_lruvec:
> > > 1. for each memcg softlimit reclaim.
> > > 2. before starting the global reclaim in shrink_zone.
> >
> > Then I would really recommend to not use soft limit at all. It has
> > always been aggressive. I have propose to make it less so in the past we
> > have decided to go that way because we simply do not know whether
> > somebody depends on that behavior. Your changelog doesn't really tell
> > the whole story. Why is this a problem all of the sudden? Nothing has
> > really changed recently AFAICT. Cgroup v1 interface is mostly for
> > backward compatibility, we have much better ways to accomplish
> > workloads isolation in cgroup v2.
> >
> > So why does it matter all of the sudden?
> >
> > Besides that EXPORT_SYMBOL for such a low level functionality as the
> > memory reclaim is a big no-no.
> >
> > So without a much better explanation and with a low level symbol
> > exported NAK from me.
> >
> My test workload is from Android system, where the multimedia apps
> require much pages. We observed that one thread of the process trapped
> into mem_cgroup_soft_limit_reclaim within direct reclaim and also
> blocked other thread in mmap or do_page_fault(by semphore?).

This requires a much more specific analysis

> Furthermore, we also observed other long time direct reclaim related
> with soft limit which are supposed to cause page thrash as the
> allocator itself is the most right of the rb_tree .

I do not follow.

> Besides, even
> without the soft_limit, shall the 'direct reclaim' check the watermark
> firstly before shrink_node, for the concurrent kswapd may have
> reclaimed enough pages for allocation.

Yes, but the direct reclaim is also a way to throttle allocation
requests and we want them to do at least some work. Making shortcuts
here can easily backfire and allow somebody to runaway too quickly.
Not that this wouldn't be possible right now but adding more heuristic
is surely tricky and far from trivial.
--
Michal Hocko
SUSE Labs

2018-07-31 17:49:48

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2] mm: terminate the reclaim early when direct reclaiming

On Tue, Jul 31, 2018 at 07:09:28PM +0800, Zhaoyang Huang wrote:
> This patch try to let the direct reclaim finish earlier than it used
> to be. The problem comes from We observing that the direct reclaim
> took a long time to finish when memcg is enabled. By debugging, we
> find that the reason is the softlimit is too low to meet the loop
> end criteria. So we add two barriers to judge if it has reclaimed
> enough memory as same criteria as it is in shrink_lruvec:
> 1. for each memcg softlimit reclaim.
> 2. before starting the global reclaim in shrink_zone.
>
> Signed-off-by: Zhaoyang Huang <[email protected]>
> ---
> include/linux/memcontrol.h | 3 ++-
> mm/memcontrol.c | 3 +++
> mm/vmscan.c | 38 +++++++++++++++++++++++++++++++++++++-
> 3 files changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6c6fb11..a7e82c7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -325,7 +325,8 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
> void mem_cgroup_uncharge_list(struct list_head *page_list);
>
> void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
> -
> +bool direct_reclaim_reach_watermark(pg_data_t *pgdat, unsigned long nr_reclaimed,
> + unsigned long nr_scanned, gfp_t gfp_mask, int order);
> static struct mem_cgroup_per_node *
> mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
> {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8c0280b..e4efd46 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2577,6 +2577,9 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> (next_mz == NULL ||
> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> break;
> + if (direct_reclaim_reach_watermark(pgdat, nr_reclaimed,
> + *total_scanned, gfp_mask, order))
> + break;
> } while (!nr_reclaimed);
> if (next_mz)
> css_put(&next_mz->memcg->css);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 03822f8..19503f3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2518,6 +2518,34 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> (memcg && memcg_congested(pgdat, memcg));
> }
>
> +bool direct_reclaim_reach_watermark(pg_data_t *pgdat, unsigned long nr_reclaimed,
> + unsigned long nr_scanned, gfp_t gfp_mask,
> + int order)
> +{
> + struct scan_control sc = {
> + .gfp_mask = gfp_mask,
> + .order = order,
> + .priority = DEF_PRIORITY,
> + .nr_reclaimed = nr_reclaimed,
> + .nr_scanned = nr_scanned,
> + };
> + if (!current_is_kswapd())
> + return false;
> + if (!IS_ENABLED(CONFIG_COMPACTION))
> + return false;
> + /*
> + * In fact, we add 1 to nr_reclaimed and nr_scanned to let should_continue_reclaim
> + * NOT return by finding they are zero, which means compaction_suitable()
> + * takes effect here to judge if we have reclaimed enough pages for passing
> + * the watermark and no necessary to check other memcg anymore.
> + */
> + if (!should_continue_reclaim(pgdat,
> + sc.nr_reclaimed + 1, sc.nr_scanned + 1, &sc))
> + return true;

As per the previous submission, should_continue_reclaim() is really
not an appropriate function to use. It's for the compaction policy.
You might be able to hack around it with faking the reclaim progress,
but this will fail in subtle ways when people change the compaction
policy in that function in the future.

If you use it only for the watermark check, check it explicitly.

Other than that, I agree with Michal that we need much more data on
this; what the configuration is like, what the memory situation is
like when your problem happens, how long the stalls are, and how your
patch helps with overreclaim etc.