LinuxLists.cc - [RFC][PATCH 00/12] mm: tweak page cache migration

2020-10-06 22:47:51

Subject: [RFC][PATCH 00/12] mm: tweak page cache migration

First of all, I think this little slice of code is a bit
under-documented. Perhaps this will help clarify things.

I'm pretty confident the page_count() check in the first
patch is right, which is why I removed it outright. The
xas_load() check is a bit murkier, so I just left a
warning in for it.

Cc: Nicholas Piggin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: [email protected]
Cc: [email protected]

2020-10-06 22:47:51

by Dave Hansen

[permalink] [raw]

Subject: [RFC][PATCH 01/12] mm/vmscan: restore zone_reclaim_mode ABI

From: Dave Hansen <[email protected]>

I went to go add a new RECLAIM_* mode for the zone_reclaim_mode
sysctl. Like a good kernel developer, I also went to go update the
documentation. I noticed that the bits in the documentation didn't
match the bits in the #defines.

The VM never explicitly checks the RECLAIM_ZONE bit. The bit is,
however implicitly checked when checking 'node_reclaim_mode==0'.
The RECLAIM_ZONE #define was removed in a cleanup. That, by itself
is fine.

But, when the bit was removed (bit 0) the _other_ bit locations also
got changed. That's not OK because the bit values are documented to
mean one specific thing and users surely rely on them meaning that one
thing and not changing from kernel to kernel. The end result is that
if someone had a script that did:

sysctl vm.zone_reclaim_mode=1

This script would have gone from enalbing node reclaim for clean
unmapped pages to writing out pages during node reclaim after the
commit in question. That's not great.

Put the bits back the way they were and add a comment so something
like this is a bit harder to do again. Update the documentation to
make it clear that the first bit is ignored.

Signed-off-by: Dave Hansen <[email protected]>
Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE")
Reviewed-by: Ben Widawsky <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Daniel Wagner <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Daniel Wagner <[email protected]>
Cc: [email protected]

--

Changes from v2:
* Update description to indicate that bit0 was used for clean
unmapped page node reclaim.
---

b/Documentation/admin-guide/sysctl/vm.rst | 10 +++++-----
b/mm/vmscan.c | 9 +++++++--
2 files changed, 12 insertions(+), 7 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi 2020-10-06 13:39:20.595818443 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst 2020-10-06 13:39:20.601818443 -0700
@@ -976,11 +976,11 @@ that benefit from having their data cach
left disabled as the caching effect is likely to be more important than
data locality.

-zone_reclaim may be enabled if it's known that the workload is partitioned
-such that each partition fits within a NUMA node and that accessing remote
-memory would cause a measurable performance reduction. The page allocator
-will then reclaim easily reusable pages (those page cache pages that are
-currently not used) before allocating off node pages.
+Consider enabling one or more zone_reclaim mode bits if it's known that the
+workload is partitioned such that each partition fits within a NUMA node
+and that accessing remote memory would cause a measurable performance
+reduction. The page allocator will take additional actions before
+allocating off node pages.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
diff -puN mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi 2020-10-06 13:39:20.597818443 -0700
+++ b/mm/vmscan.c 2020-10-06 13:39:20.602818443 -0700
@@ -4083,8 +4083,13 @@ module_init(kswapd_init)
*/
int node_reclaim_mode __read_mostly;

-#define RECLAIM_WRITE (1<<0) /* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<1) /* Unmap pages during reclaim */
+/*
+ * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
+ * ABI. New bits are OK, but existing bits can never change.
+ */
+#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */

/*
* Priority for NODE_RECLAIM. This determines the fraction of pages
_

2020-10-06 22:47:52

by Dave Hansen

[permalink] [raw]

Subject: [RFC][PATCH 04/12] mm/numa: node demotion data structure and lookup

From: Dave Hansen <[email protected]>

Prepare for the kernel to auto-migrate pages to other memory nodes
with a user defined node migration table. This allows creating single
migration target for each NUMA node to enable the kernel to do NUMA
page migrations instead of simply reclaiming colder pages. A node
with no target is a "terminal node", so reclaim acts normally there.
The migration target does not fundamentally _need_ to be a single node,
but this implementation starts there to limit complexity.

If you consider the migration path as a graph, cycles (loops) in the
graph are disallowed. This avoids wasting resources by constantly
migrating (A->B, B->A, A->B ...). The expectation is that cycles will
never be allowed.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>

--

changes in July 2020:
- Remove loop from next_demotion_node() and get_online_mems().
This means that the node returned by next_demotion_node()
might now be offline, but the worst case is that the
allocation fails. That's fine since it is transient.
---

b/mm/migrate.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path mm/migrate.c
--- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path 2020-10-06 13:39:24.067818434 -0700
+++ b/mm/migrate.c 2020-10-06 13:39:24.071818434 -0700
@@ -1161,6 +1161,22 @@ out:
return rc;
}

+static int node_demotion[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};
+
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * @returns: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal. This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+ return node_demotion[node];
+}
+
/*
* Obtain the lock on page, remove all ptes and migrate the page
* to the newly allocated page in newpage.
_

2020-10-07 08:46:37

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: [RFC][PATCH 01/12] mm/vmscan: restore zone_reclaim_mode ABI

On Tue, 6 Oct 2020, Dave Hansen wrote:

> But, when the bit was removed (bit 0) the _other_ bit locations also
> got changed. That's not OK because the bit values are documented to
> mean one specific thing and users surely rely on them meaning that one
> thing and not changing from kernel to kernel. The end result is that
> if someone had a script that did:

Exactly right. Sorry must have missed to review that patch.

Acked-by: Christoph Lameter <[email protected]>

2020-10-07 09:56:36

by Michal Hocko

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/12] mm: tweak page cache migration

Am I the only one missing patch 1-5? lore.k.o doesn't seem to link them
under this message id either.

On Tue 06-10-20 13:51:03, Dave Hansen wrote:
> First of all, I think this little slice of code is a bit
> under-documented. Perhaps this will help clarify things.
>
> I'm pretty confident the page_count() check in the first
> patch is right, which is why I removed it outright. The
> xas_load() check is a bit murkier, so I just left a
> warning in for it.
>
> Cc: Nicholas Piggin <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Matthew Wilcox (Oracle) <[email protected]>
> Cc: Yang Shi <[email protected]>
> Cc: [email protected]
> Cc: [email protected]

--
Michal Hocko
SUSE Labs

2020-10-07 09:57:31

by David Hildenbrand

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/12] mm: tweak page cache migration

On 07.10.20 11:52, Michal Hocko wrote:
> Am I the only one missing patch 1-5? lore.k.o doesn't seem to link them
> under this message id either.

I received no patches via linux-mm, only the cover letter and Dave's
reply. (maybe some are still in flight ...)

--
Thanks,

David / dhildenb

2020-10-07 16:20:32

by Yang Shi

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/12] mm: tweak page cache migration

On Wed, Oct 7, 2020 at 2:55 AM David Hildenbrand <[email protected]> wrote:
>
> On 07.10.20 11:52, Michal Hocko wrote:
> > Am I the only one missing patch 1-5? lore.k.o doesn't seem to link them
> > under this message id either.
>
> I received no patches via linux-mm, only the cover letter and Dave's
> reply. (maybe some are still in flight ...)

Yes, exactly the same to me, but anyway I saw the patches via linux-kernel.

And, it seems the github series doesn't reflect the changes made by this series.

>
> --
> Thanks,
>
> David / dhildenb
>
>

2020-10-07 16:23:48

by Dave Hansen

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/12] mm: tweak page cache migration

On 10/7/20 8:52 AM, Yang Shi wrote:
> On Wed, Oct 7, 2020 at 2:55 AM David Hildenbrand <[email protected]> wrote:
>> On 07.10.20 11:52, Michal Hocko wrote:
>>> Am I the only one missing patch 1-5? lore.k.o doesn't seem to link them
>>> under this message id either.
>> I received no patches via linux-mm, only the cover letter and Dave's
>> reply. (maybe some are still in flight ...)
> Yes, exactly the same to me, but anyway I saw the patches via linux-kernel.
>
> And, it seems the github series doesn't reflect the changes made by this series.

Sorry about that. I'll try to resend the series.

There have been some Intel->list troubles as of late, but I think I'm
probably to blame for this one.