2021-01-26 10:27:44

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 00/13] [v5] Migrate Pages in lieu of discard


The full series is also available here:

https://github.com/hansendc/linux/tree/automigrate-20210122

The meat of this patch is in:

[PATCH 08/13] mm/migrate: demote pages during reclaim

Which also has the most changes since the last post. This version is
mostly to address review comments from Yang Shi and Oscar Salvador.
Review comments are documented in the individual patch changelogs.

This also contains a few prerequisite patches that fix up an issue
with the vm.zone_reclaim_mode sysctl ABI.

--

We're starting to see systems with more and more kinds of memory such
as Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out. Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties. First, the newer allocations can end
up in the slower persistent memory. Second, reclaimed data in DRAM
are just discarded even if there are gobs of space in persistent
memory that could be used.

This set implements a solution to these problems. At the end of the
reclaim process in shrink_page_list() just before the last page
refcount is dropped, the page is migrated to persistent memory instead
of being dropped.

While I've talked about a DRAM/PMEM pairing, this approach would
function in any environment where memory tiers exist.

This is not perfect. It "strands" pages in slower memory and never
brings them back to fast DRAM. Other things need to be built to
promote hot pages back to DRAM.

This is also all based on an upstream mechanism that allows
persistent memory to be onlined and used as if it were volatile:

http://lkml.kernel.org/r/[email protected]

== Open Issues ==

* For cpusets and memory policies that restrict allocations
to PMEM, is it OK to demote to PMEM? Do we need a cgroup-
level API to opt-in or opt-out of these migrations?

--

Changes since (automigrate-20200818):
* Fall back to normal reclaim when demotion fails
* Fix some compile issues, when page migration and NUMA are off

Changes since (automigrate-20201007):
* separate out checks for "can scan anon LRU" from "can actually
swap anon pages right now". Previous series conflated them
and may have been overly aggressive scanning LRU
* add MR_DEMOTION to tracepoint header
* remove unnecessary hugetlb page check

Changes since (https://lwn.net/Articles/824830/):
* Use higher-level migrate_pages() API approach from Yang Shi's
earlier patches.
* made sure to actually check node_reclaim_mode's new bit
* disabled migration entirely before introducing RECLAIM_MIGRATE
* Replace GFP_NOWAIT with explicit __GFP_KSWAPD_RECLAIM and
comment why we want that.
* Comment on effects of that keep multiple source nodes from
sharing target nodes

Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: osalvador <[email protected]>
Cc: Huang Ying <[email protected]>


2021-01-26 10:28:15

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 02/13] mm/vmscan: move RECLAIM* bits to uapi header


From: Dave Hansen <[email protected]>

It is currently not obvious that the RECLAIM_* bits are part of the
uapi since they are defined in vmscan.c. Move them to a uapi header
to make it obvious.

This should have no functional impact.

Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Daniel Wagner <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Daniel Wagner <[email protected]>
Cc: osalvador <[email protected]>

--

Note: This is not cc'd to stable. It does not fix any bugs.
---

b/include/uapi/linux/mempolicy.h | 7 +++++++
b/mm/vmscan.c | 8 --------
2 files changed, 7 insertions(+), 8 deletions(-)

diff -puN include/uapi/linux/mempolicy.h~mm-vmscan-move-RECLAIM-bits-to-uapi include/uapi/linux/mempolicy.h
--- a/include/uapi/linux/mempolicy.h~mm-vmscan-move-RECLAIM-bits-to-uapi 2021-01-25 16:23:07.197866715 -0800
+++ b/include/uapi/linux/mempolicy.h 2021-01-25 16:23:07.203866715 -0800
@@ -62,5 +62,12 @@ enum {
#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
#define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */

+/*
+ * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
+ * ABI. New bits are OK, but existing bits can never change.
+ */
+#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */

#endif /* _UAPI_LINUX_MEMPOLICY_H */
diff -puN mm/vmscan.c~mm-vmscan-move-RECLAIM-bits-to-uapi mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-move-RECLAIM-bits-to-uapi 2021-01-25 16:23:07.199866715 -0800
+++ b/mm/vmscan.c 2021-01-25 16:23:07.204866715 -0800
@@ -4087,14 +4087,6 @@ module_init(kswapd_init)
int node_reclaim_mode __read_mostly;

/*
- * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
- * ABI. New bits are OK, but existing bits can never change.
- */
-#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
-#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
-
-/*
* Priority for NODE_RECLAIM. This determines the fraction of pages
* of a node considered for each zone_reclaim. 4 scans 1/16th of
* a zone.
_

2021-01-26 10:29:21

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 10/13] mm/vmscan: add helper for querying ability to age anonymous pages


From: Dave Hansen <[email protected]>

Anonymous pages are kept on their own LRU(s). These lists could theoretically
always be scanned and maintained. But, without swap, there is currently
nothing the kernel can *do* with the results of a scanned, sorted LRU for
anonymous pages.

A check for '!total_swap_pages' currently serves as a valid check as to
whether anonymous LRUs should be maintained. However, another method will
be added shortly: page demotion.

Abstract out the 'total_swap_pages' checks into a helper, give it a
logically significant name, and check for the possibility of page
demotion.

Signed-off-by: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: osalvador <[email protected]>
---

b/mm/vmscan.c | 28 +++++++++++++++++++++++++---
1 file changed, 25 insertions(+), 3 deletions(-)

diff -puN mm/vmscan.c~mm-vmscan-anon-can-be-aged mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-anon-can-be-aged 2021-01-25 16:23:17.044866690 -0800
+++ b/mm/vmscan.c 2021-01-25 16:23:17.053866690 -0800
@@ -2508,6 +2508,26 @@ out:
}
}

+/*
+ * Anonymous LRU management is a waste if there is
+ * ultimately no way to reclaim the memory.
+ */
+bool anon_should_be_aged(struct lruvec *lruvec)
+{
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+ /* Aging the anon LRU is valuable if swap is present: */
+ if (total_swap_pages > 0)
+ return true;
+
+ /* Also valuable if anon pages can be demoted: */
+ if (next_demotion_node(pgdat->node_id) >= 0)
+ return true;
+
+ /* No way to reclaim anon pages. Should not age anon LRUs: */
+ return false;
+}
+
static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
@@ -2617,7 +2637,8 @@ static void shrink_lruvec(struct lruvec
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
- if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
+ if (anon_should_be_aged(lruvec) &&
+ inactive_is_low(lruvec, LRU_INACTIVE_ANON))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
}
@@ -3446,10 +3467,11 @@ static void age_active_anon(struct pglis
struct mem_cgroup *memcg;
struct lruvec *lruvec;

- if (!total_swap_pages)
+ lruvec = mem_cgroup_lruvec(NULL, pgdat);
+
+ if (!anon_should_be_aged(lruvec))
return;

- lruvec = mem_cgroup_lruvec(NULL, pgdat);
if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
return;

_

2021-01-26 12:37:17

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 12/13] mm/vmscan: never demote for memcg reclaim


From: Dave Hansen <[email protected]>

Global reclaim aims to reduce the amount of memory used on
a given node or set of nodes. Migrating pages to another
node serves this purpose.

memcg reclaim is different. Its goal is to reduce the
total memory consumption of the entire memcg, across all
nodes. Migration does not assist memcg reclaim because
it just moves page contents between nodes rather than
actually reducing memory consumption.

Signed-off-by: Dave Hansen <[email protected]>
Suggested-by: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: osalvador <[email protected]>
---

b/mm/vmscan.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c
--- a/mm/vmscan.c~never-demote-for-memcg-reclaim 2021-01-25 16:23:19.180866685 -0800
+++ b/mm/vmscan.c 2021-01-25 16:23:19.185866685 -0800
@@ -290,7 +290,8 @@ static bool writeback_throttling_sane(st
#endif

static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
- int node_id)
+ int node_id,
+ struct scan_control *sc)
{
if (memcg == NULL) {
/*
@@ -328,7 +329,7 @@ unsigned long zone_reclaimable_pages(str

nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
- if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
+ if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);

@@ -1065,7 +1066,8 @@ static enum page_references page_check_r
return PAGEREF_RECLAIM;
}

-static bool migrate_demote_page_ok(struct page *page)
+static bool migrate_demote_page_ok(struct page *page,
+ struct scan_control *sc)
{
int next_nid = next_demotion_node(page_to_nid(page));

@@ -1073,6 +1075,10 @@ static bool migrate_demote_page_ok(struc
VM_BUG_ON_PAGE(PageHuge(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);

+ /* It is pointless to do demotion in memcg reclaim */
+ if (cgroup_reclaim(sc))
+ return false;
+
if (next_nid == NUMA_NO_NODE)
return false;
if (PageTransHuge(page) && !thp_migration_supported())
@@ -1329,7 +1335,7 @@ retry:
* Before reclaiming the page, try to relocate
* its contents to another node.
*/
- if (do_demote_pass && migrate_demote_page_ok(page)) {
+ if (do_demote_pass && migrate_demote_page_ok(page, sc)) {
list_add(&page->lru, &demote_pages);
unlock_page(page);
continue;
@@ -2362,7 +2368,7 @@ static void get_scan_count(struct lruvec
enum lru_list lru;

/* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
+ if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2737,7 +2743,7 @@ static inline bool should_continue_recla
*/
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
- if (can_reclaim_anon_pages(NULL, pgdat->node_id))
+ if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);

return inactive_lru_pages > pages_for_compaction;
_

2021-01-26 18:13:59

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 01/13] mm/vmscan: restore zone_reclaim_mode ABI


From: Dave Hansen <[email protected]>

I went to go add a new RECLAIM_* mode for the zone_reclaim_mode
sysctl. Like a good kernel developer, I also went to go update the
documentation. I noticed that the bits in the documentation didn't
match the bits in the #defines.

The VM never explicitly checks the RECLAIM_ZONE bit. The bit is,
however implicitly checked when checking 'node_reclaim_mode==0'.
The RECLAIM_ZONE #define was removed in a cleanup. That, by itself
is fine.

But, when the bit was removed (bit 0) the _other_ bit locations also
got changed. That's not OK because the bit values are documented to
mean one specific thing and users surely rely on them meaning that one
thing and not changing from kernel to kernel. The end result is that
if someone had a script that did:

sysctl vm.zone_reclaim_mode=1

This script would have gone from enalbing node reclaim for clean
unmapped pages to writing out pages during node reclaim after the
commit in question. That's not great.

Put the bits back the way they were and add a comment so something
like this is a bit harder to do again. Update the documentation to
make it clear that the first bit is ignored.

Signed-off-by: Dave Hansen <[email protected]>
Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE")
Reviewed-by: Ben Widawsky <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Daniel Wagner <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Daniel Wagner <[email protected]>
Cc: osalvador <[email protected]>
Cc: [email protected]

--

Changes from v2:
* Update description to indicate that bit0 was used for clean
unmapped page node reclaim.
---

b/Documentation/admin-guide/sysctl/vm.rst | 10 +++++-----
b/mm/vmscan.c | 9 +++++++--
2 files changed, 12 insertions(+), 7 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi 2021-01-25 16:23:06.048866718 -0800
+++ b/Documentation/admin-guide/sysctl/vm.rst 2021-01-25 16:23:06.056866718 -0800
@@ -978,11 +978,11 @@ that benefit from having their data cach
left disabled as the caching effect is likely to be more important than
data locality.

-zone_reclaim may be enabled if it's known that the workload is partitioned
-such that each partition fits within a NUMA node and that accessing remote
-memory would cause a measurable performance reduction. The page allocator
-will then reclaim easily reusable pages (those page cache pages that are
-currently not used) before allocating off node pages.
+Consider enabling one or more zone_reclaim mode bits if it's known that the
+workload is partitioned such that each partition fits within a NUMA node
+and that accessing remote memory would cause a measurable performance
+reduction. The page allocator will take additional actions before
+allocating off node pages.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
diff -puN mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi 2021-01-25 16:23:06.052866718 -0800
+++ b/mm/vmscan.c 2021-01-25 16:23:06.057866718 -0800
@@ -4086,8 +4086,13 @@ module_init(kswapd_init)
*/
int node_reclaim_mode __read_mostly;

-#define RECLAIM_WRITE (1<<0) /* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<1) /* Unmap pages during reclaim */
+/*
+ * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
+ * ABI. New bits are OK, but existing bits can never change.
+ */
+#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */

/*
* Priority for NODE_RECLAIM. This determines the fraction of pages
_

2021-01-26 18:14:02

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 13/13] mm/migrate: new zone_reclaim_mode to enable reclaim migration


From: Dave Hansen <[email protected]>

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold. If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This propses extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

Signed-off-by: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: osalvador <[email protected]>
---

b/Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++
b/include/linux/swap.h | 3 ++-
b/include/uapi/linux/mempolicy.h | 1 +
b/mm/vmscan.c | 6 ++++--
4 files changed, 16 insertions(+), 3 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 2021-01-25 16:23:43.721866624 -0800
+++ b/Documentation/admin-guide/sysctl/vm.rst 2021-01-25 16:23:43.732866624 -0800
@@ -971,6 +971,7 @@ This is value OR'ed together of
1 Zone reclaim on
2 Zone reclaim writes dirty pages out
4 Zone reclaim swaps pages
+8 Zone reclaim migrates pages
= ===================================

zone_reclaim_mode is disabled by default. For file servers or workloads
@@ -995,3 +996,11 @@ of other processes running on other node
Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations. These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances. Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure. This migration is
+performed before swap.
diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
--- a/include/linux/swap.h~RECLAIM_MIGRATE 2021-01-25 16:23:43.723866624 -0800
+++ b/include/linux/swap.h 2021-01-25 16:23:43.732866624 -0800
@@ -384,7 +384,8 @@ extern int sysctl_min_slab_ratio;
static inline bool node_reclaim_enabled(void)
{
/* Is any node_reclaim_mode bit set? */
- return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
+ return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
+ RECLAIM_UNMAP|RECLAIM_MIGRATE);
}

extern void check_move_unevictable_pages(struct pagevec *pvec);
diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE include/uapi/linux/mempolicy.h
--- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 2021-01-25 16:23:43.725866624 -0800
+++ b/include/uapi/linux/mempolicy.h 2021-01-25 16:23:43.732866624 -0800
@@ -69,5 +69,6 @@ enum {
#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE (1<<3) /* Migrate to other nodes during reclaim */

#endif /* _UAPI_LINUX_MEMPOLICY_H */
diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c
--- a/mm/vmscan.c~RECLAIM_MIGRATE 2021-01-25 16:23:43.728866624 -0800
+++ b/mm/vmscan.c 2021-01-25 16:23:43.734866624 -0800
@@ -1075,6 +1075,9 @@ static bool migrate_demote_page_ok(struc
VM_BUG_ON_PAGE(PageHuge(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);

+ if (!(node_reclaim_mode & RECLAIM_MIGRATE))
+ return false;
+
/* It is pointless to do demotion in memcg reclaim */
if (cgroup_reclaim(sc))
return false;
@@ -1084,8 +1087,7 @@ static bool migrate_demote_page_ok(struc
if (PageTransHuge(page) && !thp_migration_supported())
return false;

- // FIXME: actually enable this later in the series
- return false;
+ return true;
}


_

2021-01-31 01:16:47

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] [v5] Migrate Pages in lieu of discard

On Mon, 25 Jan 2021, Dave Hansen wrote:

> This also contains a few prerequisite patches that fix up an issue
> with the vm.zone_reclaim_mode sysctl ABI.
>

I think these patches (patches 1-3) can be staged in -mm now since they
fix vm.zone_reclaim_mode correctness and consistency.

Andrew, would it be possible to take patches 1-3 now since they are fix an
existing issue?

2021-02-10 09:47:28

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/13] mm/vmscan: restore zone_reclaim_mode ABI

On Mon, Jan 25, 2021 at 04:34:13PM -0800, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> I went to go add a new RECLAIM_* mode for the zone_reclaim_mode
> sysctl. Like a good kernel developer, I also went to go update the
> documentation. I noticed that the bits in the documentation didn't
> match the bits in the #defines.
>
> The VM never explicitly checks the RECLAIM_ZONE bit. The bit is,
> however implicitly checked when checking 'node_reclaim_mode==0'.
> The RECLAIM_ZONE #define was removed in a cleanup. That, by itself
> is fine.
>
> But, when the bit was removed (bit 0) the _other_ bit locations also
> got changed. That's not OK because the bit values are documented to
> mean one specific thing and users surely rely on them meaning that one
> thing and not changing from kernel to kernel. The end result is that
> if someone had a script that did:
>
> sysctl vm.zone_reclaim_mode=1
>
> This script would have gone from enalbing node reclaim for clean
> unmapped pages to writing out pages during node reclaim after the
> commit in question. That's not great.
>
> Put the bits back the way they were and add a comment so something
> like this is a bit harder to do again. Update the documentation to
> make it clear that the first bit is ignored.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE")
> Reviewed-by: Ben Widawsky <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Acked-by: Christoph Lameter <[email protected]>
> Cc: Alex Shi <[email protected]>
> Cc: Daniel Wagner <[email protected]>
> Cc: "Tobin C. Harding" <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Qian Cai <[email protected]>
> Cc: Daniel Wagner <[email protected]>
> Cc: osalvador <[email protected]>
> Cc: [email protected]

Reviewed-by: Oscar Salvador <[email protected]>

>
> --
>
> Changes from v2:
> * Update description to indicate that bit0 was used for clean
> unmapped page node reclaim.
> ---
>
> b/Documentation/admin-guide/sysctl/vm.rst | 10 +++++-----
> b/mm/vmscan.c | 9 +++++++--
> 2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi 2021-01-25 16:23:06.048866718 -0800
> +++ b/Documentation/admin-guide/sysctl/vm.rst 2021-01-25 16:23:06.056866718 -0800
> @@ -978,11 +978,11 @@ that benefit from having their data cach
> left disabled as the caching effect is likely to be more important than
> data locality.
>
> -zone_reclaim may be enabled if it's known that the workload is partitioned
> -such that each partition fits within a NUMA node and that accessing remote
> -memory would cause a measurable performance reduction. The page allocator
> -will then reclaim easily reusable pages (those page cache pages that are
> -currently not used) before allocating off node pages.
> +Consider enabling one or more zone_reclaim mode bits if it's known that the
> +workload is partitioned such that each partition fits within a NUMA node
> +and that accessing remote memory would cause a measurable performance
> +reduction. The page allocator will take additional actions before
> +allocating off node pages.
>
> Allowing zone reclaim to write out pages stops processes that are
> writing large amounts of data from dirtying pages on other nodes. Zone
> diff -puN mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi 2021-01-25 16:23:06.052866718 -0800
> +++ b/mm/vmscan.c 2021-01-25 16:23:06.057866718 -0800
> @@ -4086,8 +4086,13 @@ module_init(kswapd_init)
> */
> int node_reclaim_mode __read_mostly;
>
> -#define RECLAIM_WRITE (1<<0) /* Writeout pages during reclaim */
> -#define RECLAIM_UNMAP (1<<1) /* Unmap pages during reclaim */
> +/*
> + * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
> + * ABI. New bits are OK, but existing bits can never change.
> + */
> +#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
> +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
> +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
>
> /*
> * Priority for NODE_RECLAIM. This determines the fraction of pages
> _
>

--
Oscar Salvador
SUSE L3

2021-02-10 09:47:53

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC][PATCH 02/13] mm/vmscan: move RECLAIM* bits to uapi header

On Mon, Jan 25, 2021 at 04:34:15PM -0800, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> It is currently not obvious that the RECLAIM_* bits are part of the
> uapi since they are defined in vmscan.c. Move them to a uapi header
> to make it obvious.
>
> This should have no functional impact.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Reviewed-by: Ben Widawsky <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Acked-by: Christoph Lameter <[email protected]>
> Cc: Alex Shi <[email protected]>
> Cc: Daniel Wagner <[email protected]>
> Cc: "Tobin C. Harding" <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Qian Cai <[email protected]>
> Cc: Daniel Wagner <[email protected]>
> Cc: osalvador <[email protected]>

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3