2023-12-13 01:39:01

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH V4 0/1] Add swappiness argument to memory.reclaim

Changes since V3:
* Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
* Added explicit calls to mem_cgroup_swappiness

Changes since V2:
* No functional change
* Used int consistently rather than a pointer

Changes since V1:
* Added documentation

This patch proposes augmenting the memory.reclaim interface with a
swappiness=<val> argument that overrides the swappiness value for that instance
of proactive reclaim.

Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim. The memory.reclaim interface does not allow for any way to effect the
balance of file vs anon during proactive reclaim. The only approach is to adjust
the vm.swappiness setting. However, there are a few reasons we look to control
the balance of file vs anon during proactive reclaim, separately from reactive
reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
situations we are fine with lots of swap-out to avoid OOMs. As these are
typically rare events, they have relatively little impact on write endurance.
However, proactive reclaim runs continuously and so its impact on SSD write
endurance is more significant. Therefore it is desireable to control swap-out
for proactive reclaim separately from reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
exhaustion. This makes sense if the swap exhaustion is triggered due to
reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
one could see OOMs when free memory is ample but anon is just particularly
cold). Therefore, it's desireable to have proactive reclaim reduce or stop
swap-out before the threshold at which OOM killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
writes to memory.reclaim[2]. This has been in production for nearly two years
and has addressed our needs to control proactive vs reactive reclaim behavior
but is still not ideal for a number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere with other
system administration that wishes to control vm.swappiness. In our case, we
need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave a
misconfigured setting. This requires some additional management to record the
"desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness globally.

Previously, this exact interface addition was proposed by Yosry[3]. In response,
Roman proposed instead an interface to specify precise file/anon/slab reclaim
amounts[4]. More recently Huan also proposed this as well[5] and others
similarly questioned if this was the proper interface.

Previous proposals sought to use this to allow proactive reclaimers to
effectively perform a custom reclaim algorithm by issuing proactive reclaim with
different settings to control file vs anon reclaim (e.g. to only reclaim anon
from some applications). Responses argued that adjusting swappiness is a poor
interface for custom reclaim.

In contrast, I argue in favor of a swappiness setting not as a way to implement
custom reclaim algorithms but rather to bias the balance of anon vs file due to
differences of proactive vs reactive reclaim. In this context, swappiness is the
existing interface for controlling this balance and this patch simply allows for
it to be configured differently for proactive vs reactive reclaim.

Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
for this prupose. Proactive reclaimers are un-aware of the relative age of file
vs anon for a cgroup which makes it difficult to manage proactive reclaim of
different memory pools. A proactive reclaimer would need some amount of anon
reclaim attempts separate from the amount of file reclaim attempts which seems
brittle given that it's difficult to observe the impact.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
[3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
[4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
[5]https://lore.kernel.org/lkml/[email protected]/

Dan Schatzberg (2):
mm: add defines for min/max swappiness
mm: add swapiness= arg to memory.reclaim

Documentation/admin-guide/cgroup-v2.rst | 19 +++++---
include/linux/swap.h | 5 +-
mm/memcontrol.c | 63 ++++++++++++++++++++-----
mm/vmscan.c | 21 +++++----
4 files changed, 80 insertions(+), 28 deletions(-)

--
2.34.1


2023-12-13 01:39:21

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim

Allow proactive reclaimers to submit an additional swappiness=<val>
argument to memory.reclaim. This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Signed-off-by: Dan Schatzberg <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 19 +++++---
include/linux/swap.h | 3 +-
mm/memcontrol.c | 61 ++++++++++++++++++++-----
mm/vmscan.c | 11 +++--
4 files changed, 72 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..06319349c072 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
This is a simple interface to trigger memory reclaim in the
target cgroup.

- This file accepts a single key, the number of bytes to reclaim.
- No nested keys are currently supported.
-
Example::

echo "1G" > memory.reclaim

- The interface can be later extended with nested keys to
- configure the reclaim behavior. For example, specify the
- type of memory to reclaim from (anon, file, ..).
-
Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned.
@@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
This means that the networking layer will not adapt based on
reclaim induced by memory.reclaim.

+The following nested keys are defined.
+
+ ========== ================================
+ swappiness Swappiness value to reclaim with
+ ========== ================================
+
+ Specifying a swappiness value instructs the kernel to perform
+ the reclaim with that swappiness value. Note that this has the
+ same semantics as the vm.swappiness sysctl - it sets the
+ relative IO cost of reclaiming anon vs file memory but does
+ not allow for reclaiming specific amounts of anon or file memory.
+
memory.peak
A read-only single value file which exists on non-root
cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e2ab76c25b4a..690898c347de 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options);
+ unsigned int reclaim_options,
+ int swappiness);
extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9748a6b88bb8..be3d5797d9df 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
#include <linux/resume_user_mode.h>
#include <linux/psi.h>
#include <linux/seq_buf.h>
+#include <linux/parser.h>
#include <linux/sched/isolation.h>
#include "internal.h"
#include <net/sock.h>
@@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
psi_memstall_enter(&pflags);
nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
gfp_mask,
- MEMCG_RECLAIM_MAY_SWAP);
+ MEMCG_RECLAIM_MAY_SWAP,
+ mem_cgroup_swappiness(memcg));
psi_memstall_leave(&pflags);
} while ((memcg = parent_mem_cgroup(memcg)) &&
!mem_cgroup_is_root(memcg));
@@ -2740,7 +2742,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,

psi_memstall_enter(&pflags);
nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
- gfp_mask, reclaim_options);
+ gfp_mask, reclaim_options,
+ mem_cgroup_swappiness(memcg));
psi_memstall_leave(&pflags);

if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3660,7 +3663,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
}

if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+ memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+ mem_cgroup_swappiness(memcg))) {
ret = -EBUSY;
break;
}
@@ -3774,7 +3778,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
return -EINTR;

if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- MEMCG_RECLAIM_MAY_SWAP))
+ MEMCG_RECLAIM_MAY_SWAP,
+ mem_cgroup_swappiness(memcg)))
nr_retries--;
}

@@ -6720,7 +6725,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
}

reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+ mem_cgroup_swappiness(memcg));

if (!reclaimed && !nr_retries--)
break;
@@ -6769,7 +6775,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,

if (nr_reclaims) {
if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+ mem_cgroup_swappiness(memcg)))
nr_reclaims--;
continue;
}
@@ -6895,6 +6902,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}

+enum {
+ MEMORY_RECLAIM_SWAPPINESS = 0,
+ MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t tokens = {
+ { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+ { MEMORY_RECLAIM_NULL, NULL },
+};
+
static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
@@ -6902,12 +6919,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
unsigned int nr_retries = MAX_RECLAIM_RETRIES;
unsigned long nr_to_reclaim, nr_reclaimed = 0;
unsigned int reclaim_options;
- int err;
+ char *old_buf, *start;
+ substring_t args[MAX_OPT_ARGS];
+ int swappiness = mem_cgroup_swappiness(memcg);

buf = strstrip(buf);
- err = page_counter_memparse(buf, "", &nr_to_reclaim);
- if (err)
- return err;
+
+ old_buf = buf;
+ nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+ if (buf == old_buf)
+ return -EINVAL;
+
+ buf = strstrip(buf);
+
+ while ((start = strsep(&buf, " ")) != NULL) {
+ if (!strlen(start))
+ continue;
+ switch (match_token(start, tokens, args)) {
+ case MEMORY_RECLAIM_SWAPPINESS:
+ if (match_int(&args[0], &swappiness))
+ return -EINVAL;
+ if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+ }

reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
while (nr_reclaimed < nr_to_reclaim) {
@@ -6926,7 +6964,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,

reclaimed = try_to_free_mem_cgroup_pages(memcg,
min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
- GFP_KERNEL, reclaim_options);
+ GFP_KERNEL, reclaim_options,
+ swappiness);

if (!reclaimed && !nr_retries--)
return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2aa671fe938b..cfef06c7c23f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -136,6 +136,9 @@ struct scan_control {
/* Always discard instead of demoting to lower tier memory */
unsigned int no_demotion:1;

+ /* Swappiness value for reclaim */
+ int swappiness;
+
/* Allocation order */
s8 order;

@@ -2327,7 +2330,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
unsigned long anon_cost, file_cost, total_cost;
- int swappiness = mem_cgroup_swappiness(memcg);
+ int swappiness = sc->swappiness;
u64 fraction[ANON_AND_FILE];
u64 denominator = 0; /* gcc */
enum scan_balance scan_balance;
@@ -2608,7 +2611,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
return 0;

- return mem_cgroup_swappiness(memcg);
+ return sc->swappiness;
}

static int get_nr_gens(struct lruvec *lruvec, int type)
@@ -6433,7 +6436,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options)
+ unsigned int reclaim_options,
+ int swappiness)
{
unsigned long nr_reclaimed;
unsigned int noreclaim_flag;
@@ -6448,6 +6452,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
.may_unmap = 1,
.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+ .swappiness = swappiness,
};
/*
* Traverse the ZONELIST_FALLBACK zonelist of the current node to put
--
2.34.1

2023-12-13 02:06:26

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim

On Tue, Dec 12, 2023 at 6:39 PM Dan Schatzberg <[email protected]> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Signed-off-by: Dan Schatzberg <[email protected]>

NAK.

Please initialize new variables properly and test code changes
thoroughly before submission.

2023-12-13 16:38:53

by Dan Schatzberg

[permalink] [raw]
Subject: Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim

On Tue, Dec 12, 2023 at 07:05:36PM -0700, Yu Zhao wrote:
> On Tue, Dec 12, 2023 at 6:39 PM Dan Schatzberg <[email protected]> wrote:
> >
> > Allow proactive reclaimers to submit an additional swappiness=<val>
> > argument to memory.reclaim. This overrides the global or per-memcg
> > swappiness setting for that reclaim attempt.
> >
> > For example:
> >
> > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
> >
> > will perform reclaim on the rootcg with a swappiness setting of 0 (no
> > swap) regardless of the vm.swappiness sysctl setting.
> >
> > Signed-off-by: Dan Schatzberg <[email protected]>
>
> NAK.
>
> Please initialize new variables properly and test code changes
> thoroughly before submission.

Could you be a bit more specific? The patch is compiling and working
locally but perhaps there's some configuration or behavior that I
haven't been testing.

2023-12-14 04:29:49

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim

On Wed, Dec 13, 2023 at 8:38 AM Dan Schatzberg <[email protected]> wrote:
>
> On Tue, Dec 12, 2023 at 07:05:36PM -0700, Yu Zhao wrote:
> > On Tue, Dec 12, 2023 at 6:39 PM Dan Schatzberg <[email protected]> wrote:
> > >
> > > Allow proactive reclaimers to submit an additional swappiness=<val>
> > > argument to memory.reclaim. This overrides the global or per-memcg
> > > swappiness setting for that reclaim attempt.
> > >
> > > For example:
> > >
> > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
> > >
> > > will perform reclaim on the rootcg with a swappiness setting of 0 (no
> > > swap) regardless of the vm.swappiness sysctl setting.
> > >
> > > Signed-off-by: Dan Schatzberg <[email protected]>
> >
> > NAK.
> >
> > Please initialize new variables properly and test code changes
> > thoroughly before submission.
>
> Could you be a bit more specific? The patch is compiling and working
> locally but perhaps there's some configuration or behavior that I
> haven't been testing.

scan_control.swappiness is only initialized from
try_to_free_mem_cgroup_pages(), which means that swappiness is now 0
for all other types of reclaim, which can be a huge problem.

It might be easier to restore a special value (-1, 201, whatever) that
means "use mem_cgroup_swappiness()".

2023-12-14 08:39:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim

On Tue 12-12-23 17:38:03, Dan Schatzberg wrote:
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.

You are providing the usecase in the cover letter and Andrew usually
appends that to the first patch in the series. I think it would be
better to have the usecase described here.

[...]
> @@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
> This means that the networking layer will not adapt based on
> reclaim induced by memory.reclaim.
>
> +The following nested keys are defined.
> +
> + ========== ================================
> + swappiness Swappiness value to reclaim with
> + ========== ================================
> +
> + Specifying a swappiness value instructs the kernel to perform
> + the reclaim with that swappiness value. Note that this has the
> + same semantics as the vm.swappiness sysctl - it sets the

same semantics as vm.swappiness applied to memcg reclaim with all the
existing limitations and potential future extensions.

> + relative IO cost of reclaiming anon vs file memory but does
> + not allow for reclaiming specific amounts of anon or file memory.
> +
> memory.peak
> A read-only single value file which exists on non-root
> cgroups.

The biggest problem with the implementation I can see, and others have
pointed out the same, is how fragile this is. You really have to check
the code and _every_ place which defines scan_control to learn that
mem_cgroup_shrink_node, reclaim_clean_pages_from_list,
reclaim_folio_list, lru_gen_seq_write, try_to_free_pages, balance_pgdat,
shrink_all_memory and __node_reclaim. I have only checked couple of
them, like direct reclaim and kswapd and none of them really sets up
swappiness to the default memcg nor global value. So you effectively end
up with swappiness == 0.

While the review can point those out it is quite easy to break and you
will only learn about that very indirectly. I think it would be easier
to review and maintain if you go with a pointer that would fallback to
mem_cgroup_swappiness() if NULL which will be the case for every
existing reclaimer except memory.reclaim with swappiness value.
--
Michal Hocko
SUSE Labs

2023-12-14 18:22:46

by Dan Schatzberg

[permalink] [raw]
Subject: Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim

On Thu, Dec 14, 2023 at 09:38:55AM +0100, Michal Hocko wrote:
> On Tue 12-12-23 17:38:03, Dan Schatzberg wrote:
> > Allow proactive reclaimers to submit an additional swappiness=<val>
> > argument to memory.reclaim. This overrides the global or per-memcg
> > swappiness setting for that reclaim attempt.
>
> You are providing the usecase in the cover letter and Andrew usually
> appends that to the first patch in the series. I think it would be
> better to have the usecase described here.
>
> [...]
> > @@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
> > This means that the networking layer will not adapt based on
> > reclaim induced by memory.reclaim.
> >
> > +The following nested keys are defined.
> > +
> > + ========== ================================
> > + swappiness Swappiness value to reclaim with
> > + ========== ================================
> > +
> > + Specifying a swappiness value instructs the kernel to perform
> > + the reclaim with that swappiness value. Note that this has the
> > + same semantics as the vm.swappiness sysctl - it sets the
>
> same semantics as vm.swappiness applied to memcg reclaim with all the
> existing limitations and potential future extensions.

Thanks, will make this change.

>
> > + relative IO cost of reclaiming anon vs file memory but does
> > + not allow for reclaiming specific amounts of anon or file memory.
> > +
> > memory.peak
> > A read-only single value file which exists on non-root
> > cgroups.
>
> The biggest problem with the implementation I can see, and others have
> pointed out the same, is how fragile this is. You really have to check
> the code and _every_ place which defines scan_control to learn that
> mem_cgroup_shrink_node, reclaim_clean_pages_from_list,
> reclaim_folio_list, lru_gen_seq_write, try_to_free_pages, balance_pgdat,
> shrink_all_memory and __node_reclaim. I have only checked couple of
> them, like direct reclaim and kswapd and none of them really sets up
> swappiness to the default memcg nor global value. So you effectively end
> up with swappiness == 0.
>
> While the review can point those out it is quite easy to break and you
> will only learn about that very indirectly. I think it would be easier
> to review and maintain if you go with a pointer that would fallback to
> mem_cgroup_swappiness() if NULL which will be the case for every
> existing reclaimer except memory.reclaim with swappiness value.

I agree. My initial implementation used a pointer for this
reason. I'll switch this back. Just to be clear - I still need to
initialize scan_control.swappiness in all these other places right? It
appears to mostly be stack-initialized which means it has
indeterminate value, not necessarily zero.

2023-12-14 18:28:50

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim

On Thu, Dec 14, 2023 at 10:22 AM Dan Schatzberg
<[email protected]> wrote:
>
> On Thu, Dec 14, 2023 at 09:38:55AM +0100, Michal Hocko wrote:
> > On Tue 12-12-23 17:38:03, Dan Schatzberg wrote:
> > > Allow proactive reclaimers to submit an additional swappiness=<val>
> > > argument to memory.reclaim. This overrides the global or per-memcg
> > > swappiness setting for that reclaim attempt.
> >
> > You are providing the usecase in the cover letter and Andrew usually
> > appends that to the first patch in the series. I think it would be
> > better to have the usecase described here.
> >
> > [...]
> > > @@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
> > > This means that the networking layer will not adapt based on
> > > reclaim induced by memory.reclaim.
> > >
> > > +The following nested keys are defined.
> > > +
> > > + ========== ================================
> > > + swappiness Swappiness value to reclaim with
> > > + ========== ================================
> > > +
> > > + Specifying a swappiness value instructs the kernel to perform
> > > + the reclaim with that swappiness value. Note that this has the
> > > + same semantics as the vm.swappiness sysctl - it sets the
> >
> > same semantics as vm.swappiness applied to memcg reclaim with all the
> > existing limitations and potential future extensions.
>
> Thanks, will make this change.
>
> >
> > > + relative IO cost of reclaiming anon vs file memory but does
> > > + not allow for reclaiming specific amounts of anon or file memory.
> > > +
> > > memory.peak
> > > A read-only single value file which exists on non-root
> > > cgroups.
> >
> > The biggest problem with the implementation I can see, and others have
> > pointed out the same, is how fragile this is. You really have to check
> > the code and _every_ place which defines scan_control to learn that
> > mem_cgroup_shrink_node, reclaim_clean_pages_from_list,
> > reclaim_folio_list, lru_gen_seq_write, try_to_free_pages, balance_pgdat,
> > shrink_all_memory and __node_reclaim. I have only checked couple of
> > them, like direct reclaim and kswapd and none of them really sets up
> > swappiness to the default memcg nor global value. So you effectively end
> > up with swappiness == 0.
> >
> > While the review can point those out it is quite easy to break and you
> > will only learn about that very indirectly. I think it would be easier
> > to review and maintain if you go with a pointer that would fallback to
> > mem_cgroup_swappiness() if NULL which will be the case for every
> > existing reclaimer except memory.reclaim with swappiness value.
>
> I agree. My initial implementation used a pointer for this
> reason. I'll switch this back. Just to be clear - I still need to
> initialize scan_control.swappiness in all these other places right? It
> appears to mostly be stack-initialized which means it has
> indeterminate value, not necessarily zero.

My understanding is that in a partially initialized struct,
uninitialized members default to 0, but I am not a C expert :)

2023-12-15 08:50:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim

On Thu 14-12-23 13:22:29, Dan Schatzberg wrote:
> On Thu, Dec 14, 2023 at 09:38:55AM +0100, Michal Hocko wrote:
[...]
> > While the review can point those out it is quite easy to break and you
> > will only learn about that very indirectly. I think it would be easier
> > to review and maintain if you go with a pointer that would fallback to
> > mem_cgroup_swappiness() if NULL which will be the case for every
> > existing reclaimer except memory.reclaim with swappiness value.
>
> I agree. My initial implementation used a pointer for this
> reason. I'll switch this back. Just to be clear - I still need to
> initialize scan_control.swappiness in all these other places right?

No. They will just get initialized to 0. All you need to make sure is
that the swappiness is used consistently. I would propose something like
scan_control_swappiness() (or sc_swappiness) which would actually do

if (sc->swappiness)
return sc->swappiness;
return mem_cgroup_swappiness(sc->target_mem_cgroup);

and then make sure that mem_cgroup_swappiness is never used directly.

--
Michal Hocko
SUSE Labs