2023-12-20 15:27:59

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH v5 0/2] Add swappiness argument to memory.reclaim

Changes since V4:
* Fixed some initialization bugs by reverting back to a pointer for swappiness
* Added some more caveats to the behavior of swappiness in documentation

Changes since V3:
* Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
* Added explicit calls to mem_cgroup_swappiness

Changes since V2:
* No functional change
* Used int consistently rather than a pointer

Changes since V1:
* Added documentation

This patch proposes augmenting the memory.reclaim interface with a
swappiness=<val> argument that overrides the swappiness value for that instance
of proactive reclaim.

Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim. The memory.reclaim interface does not allow for any way to effect the
balance of file vs anon during proactive reclaim. The only approach is to adjust
the vm.swappiness setting. However, there are a few reasons we look to control
the balance of file vs anon during proactive reclaim, separately from reactive
reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
situations we are fine with lots of swap-out to avoid OOMs. As these are
typically rare events, they have relatively little impact on write endurance.
However, proactive reclaim runs continuously and so its impact on SSD write
endurance is more significant. Therefore it is desireable to control swap-out
for proactive reclaim separately from reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
exhaustion. This makes sense if the swap exhaustion is triggered due to
reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
one could see OOMs when free memory is ample but anon is just particularly
cold). Therefore, it's desireable to have proactive reclaim reduce or stop
swap-out before the threshold at which OOM killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
writes to memory.reclaim[2]. This has been in production for nearly two years
and has addressed our needs to control proactive vs reactive reclaim behavior
but is still not ideal for a number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere with other
system administration that wishes to control vm.swappiness. In our case, we
need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave a
misconfigured setting. This requires some additional management to record the
"desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness globally.

Previously, this exact interface addition was proposed by Yosry[3]. In response,
Roman proposed instead an interface to specify precise file/anon/slab reclaim
amounts[4]. More recently Huan also proposed this as well[5] and others
similarly questioned if this was the proper interface.

Previous proposals sought to use this to allow proactive reclaimers to
effectively perform a custom reclaim algorithm by issuing proactive reclaim with
different settings to control file vs anon reclaim (e.g. to only reclaim anon
from some applications). Responses argued that adjusting swappiness is a poor
interface for custom reclaim.

In contrast, I argue in favor of a swappiness setting not as a way to implement
custom reclaim algorithms but rather to bias the balance of anon vs file due to
differences of proactive vs reactive reclaim. In this context, swappiness is the
existing interface for controlling this balance and this patch simply allows for
it to be configured differently for proactive vs reactive reclaim.

Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
for this prupose. Proactive reclaimers are un-aware of the relative age of file
vs anon for a cgroup which makes it difficult to manage proactive reclaim of
different memory pools. A proactive reclaimer would need some amount of anon
reclaim attempts separate from the amount of file reclaim attempts which seems
brittle given that it's difficult to observe the impact.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
[3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
[4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
[5]https://lore.kernel.org/lkml/[email protected]/

Dan Schatzberg (2):
mm: add defines for min/max swappiness
mm: add swapiness= arg to memory.reclaim

Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
include/linux/swap.h | 5 ++-
mm/memcontrol.c | 58 ++++++++++++++++++++-----
mm/vmscan.c | 27 ++++++++----
4 files changed, 79 insertions(+), 29 deletions(-)

--
2.39.3



2023-12-20 15:28:15

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH v5 1/2] mm: add defines for min/max swappiness

We use the constants 0 and 200 in a few places in the mm code when
referring to the min and max swappiness. This patch adds MIN_SWAPPINESS
and MAX_SWAPPINESS #defines to improve clarity. There are no functional
changes.

Signed-off-by: Dan Schatzberg <[email protected]>
---
include/linux/swap.h | 2 ++
mm/memcontrol.c | 2 +-
mm/vmscan.c | 14 +++++++-------
3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f6dd6575b905..e2ab76c25b4a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -407,6 +407,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,

#define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
#define MEMCG_RECLAIM_PROACTIVE (1 << 2)
+#define MIN_SWAPPINESS 0
+#define MAX_SWAPPINESS 200
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b226090fd906..fbe9f02dd206 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4337,7 +4337,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);

- if (val > 200)
+ if (val > MAX_SWAPPINESS)
return -EINVAL;

if (!mem_cgroup_is_root(memcg))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9dd8977de5a2..d91963e2d47f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -183,7 +183,7 @@ struct scan_control {
#endif

/*
- * From 0 .. 200. Higher means more swappy.
+ * From 0 .. MAX_SWAPPINESS. Higher means more swappy.
*/
int vm_swappiness = 60;

@@ -2403,7 +2403,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
ap = swappiness * (total_cost + 1);
ap /= anon_cost + 1;

- fp = (200 - swappiness) * (total_cost + 1);
+ fp = (MAX_SWAPPINESS - swappiness) * (total_cost + 1);
fp /= file_cost + 1;

fraction[0] = ap;
@@ -4400,7 +4400,7 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx
{
int type, tier;
struct ctrl_pos sp, pv;
- int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+ int gain[ANON_AND_FILE] = { swappiness, MAX_SWAPPINESS - swappiness };

/*
* Compare the first tier of anon with that of file to determine which
@@ -4436,7 +4436,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
/*
* Try to make the obvious choice first. When anon and file are both
* available from the same generation, interpret swappiness 1 as file
- * first and 200 as anon first.
+ * first and MAX_SWAPPINESS as anon first.
*/
if (!swappiness)
type = LRU_GEN_FILE;
@@ -4444,7 +4444,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
type = LRU_GEN_ANON;
else if (swappiness == 1)
type = LRU_GEN_FILE;
- else if (swappiness == 200)
+ else if (swappiness == MAX_SWAPPINESS)
type = LRU_GEN_ANON;
else
type = get_type_to_scan(lruvec, swappiness, &tier);
@@ -5398,9 +5398,9 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,

lruvec = get_lruvec(memcg, nid);

- if (swappiness < 0)
+ if (swappiness < MIN_SWAPPINESS)
swappiness = get_swappiness(lruvec, sc);
- else if (swappiness > 200)
+ else if (swappiness > MAX_SWAPPINESS)
goto done;

switch (cmd) {
--
2.39.3


2023-12-20 15:35:42

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

Allow proactive reclaimers to submit an additional swappiness=<val>
argument to memory.reclaim. This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Userspace proactive reclaimers use the memory.reclaim interface to
trigger reclaim. The memory.reclaim interface does not allow for any way
to effect the balance of file vs anon during proactive reclaim. The only
approach is to adjust the vm.swappiness setting. However, there are a
few reasons we look to control the balance of file vs anon during
proactive reclaim, separately from reactive reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
situations we are fine with lots of swap-out to avoid OOMs. As these are
typically rare events, they have relatively little impact on write
endurance. However, proactive reclaim runs continuously and so its
impact on SSD write endurance is more significant. Therefore it is
desireable to control swap-out for proactive reclaim separately from
reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on
swap exhaustion. This makes sense if the swap exhaustion is triggered
due to reactive reclaim but less so if it is triggered due to proactive
reclaim (e.g. one could see OOMs when free memory is ample but anon is
just particularly cold). Therefore, it's desireable to have proactive
reclaim reduce or stop swap-out before the threshold at which OOM
killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust
vm.swappiness before writes to memory.reclaim[2]. This has been in
production for nearly two years and has addressed our needs to control
proactive vs reactive reclaim behavior but is still not ideal for a
number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere
with other system administration that wishes to control vm.swappiness.
In our case, we need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave
a misconfigured setting. This requires some additional management to
record the "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness
globally.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598

Signed-off-by: Dan Schatzberg <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
include/linux/swap.h | 3 +-
mm/memcontrol.c | 56 ++++++++++++++++++++-----
mm/vmscan.c | 13 +++++-
4 files changed, 69 insertions(+), 21 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..ee42f74e0765 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
This is a simple interface to trigger memory reclaim in the
target cgroup.

- This file accepts a single key, the number of bytes to reclaim.
- No nested keys are currently supported.
-
Example::

echo "1G" > memory.reclaim

- The interface can be later extended with nested keys to
- configure the reclaim behavior. For example, specify the
- type of memory to reclaim from (anon, file, ..).
-
Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned.
@@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back.
This means that the networking layer will not adapt based on
reclaim induced by memory.reclaim.

+The following nested keys are defined.
+
+ ========== ================================
+ swappiness Swappiness value to reclaim with
+ ========== ================================
+
+ Specifying a swappiness value instructs the kernel to perform
+ the reclaim with that swappiness value. Note that this has the
+ same semantics as vm.swappiness applied to memcg reclaim with
+ all the existing limitations and potential future extensions.
+
memory.peak
A read-only single value file which exists on non-root
cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e2ab76c25b4a..8afdec40efe3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options);
+ unsigned int reclaim_options,
+ int *swappiness);
extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fbe9f02dd206..6d627a754851 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -52,6 +52,7 @@
#include <linux/sort.h>
#include <linux/fs.h>
#include <linux/seq_file.h>
+#include <linux/parser.h>
#include <linux/vmpressure.h>
#include <linux/memremap.h>
#include <linux/mm_inline.h>
@@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
psi_memstall_enter(&pflags);
nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
gfp_mask,
- MEMCG_RECLAIM_MAY_SWAP);
+ MEMCG_RECLAIM_MAY_SWAP,
+ NULL);
psi_memstall_leave(&pflags);
} while ((memcg = parent_mem_cgroup(memcg)) &&
!mem_cgroup_is_root(memcg));
@@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,

psi_memstall_enter(&pflags);
nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
- gfp_mask, reclaim_options);
+ gfp_mask, reclaim_options, NULL);
psi_memstall_leave(&pflags);

if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
}

if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+ memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
ret = -EBUSY;
break;
}
@@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
return -EINTR;

if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- MEMCG_RECLAIM_MAY_SWAP))
+ MEMCG_RECLAIM_MAY_SWAP, NULL))
nr_retries--;
}

@@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
}

reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);

if (!reclaimed && !nr_retries--)
break;
@@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,

if (nr_reclaims) {
if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
nr_reclaims--;
continue;
}
@@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}

+enum {
+ MEMORY_RECLAIM_SWAPPINESS = 0,
+ MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t tokens = {
+ { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+ { MEMORY_RECLAIM_NULL, NULL },
+};
+
static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
unsigned int nr_retries = MAX_RECLAIM_RETRIES;
unsigned long nr_to_reclaim, nr_reclaimed = 0;
+ int swappiness = -1;
unsigned int reclaim_options;
- int err;
+ char *old_buf, *start;
+ substring_t args[MAX_OPT_ARGS];

buf = strstrip(buf);
- err = page_counter_memparse(buf, "", &nr_to_reclaim);
- if (err)
- return err;
+
+ old_buf = buf;
+ nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+ if (buf == old_buf)
+ return -EINVAL;
+
+ buf = strstrip(buf);
+
+ while ((start = strsep(&buf, " ")) != NULL) {
+ if (!strlen(start))
+ continue;
+ switch (match_token(start, tokens, args)) {
+ case MEMORY_RECLAIM_SWAPPINESS:
+ if (match_int(&args[0], &swappiness))
+ return -EINVAL;
+ if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+ }

reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
while (nr_reclaimed < nr_to_reclaim) {
@@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,

reclaimed = try_to_free_mem_cgroup_pages(memcg,
min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
- GFP_KERNEL, reclaim_options);
+ GFP_KERNEL, reclaim_options,
+ swappiness == -1 ? NULL : &swappiness);

if (!reclaimed && !nr_retries--)
return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d91963e2d47f..aa5666842c49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -92,6 +92,9 @@ struct scan_control {
unsigned long anon_cost;
unsigned long file_cost;

+ /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */
+ int *swappiness;
+
/* Can active folios be deactivated as part of reclaim? */
#define DEACTIVATE_ANON 1
#define DEACTIVATE_FILE 2
@@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
unsigned long anon_cost, file_cost, total_cost;
- int swappiness = mem_cgroup_swappiness(memcg);
+ int swappiness = sc->swappiness ?
+ *sc->swappiness : mem_cgroup_swappiness(memcg);
u64 fraction[ANON_AND_FILE];
u64 denominator = 0; /* gcc */
enum scan_balance scan_balance;
@@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
return 0;

+ if (sc->swappiness)
+ return *sc->swappiness;
+
return mem_cgroup_swappiness(memcg);
}

@@ -6463,12 +6470,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options)
+ unsigned int reclaim_options,
+ int *swappiness)
{
unsigned long nr_reclaimed;
unsigned int noreclaim_flag;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
+ .swappiness = swappiness,
.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
.reclaim_idx = MAX_NR_ZONES - 1,
--
2.39.3


2023-12-21 02:01:36

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] mm: add defines for min/max swappiness

On Wed, Dec 20, 2023 at 7:27 AM Dan Schatzberg <[email protected]> wrote:
>
> We use the constants 0 and 200 in a few places in the mm code when
> referring to the min and max swappiness. This patch adds MIN_SWAPPINESS
> and MAX_SWAPPINESS #defines to improve clarity. There are no functional
> changes.
>
> Signed-off-by: Dan Schatzberg <[email protected]>
> ---
> include/linux/swap.h | 2 ++
> mm/memcontrol.c | 2 +-
> mm/vmscan.c | 14 +++++++-------
> 3 files changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f6dd6575b905..e2ab76c25b4a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -407,6 +407,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>
> #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
> #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
> +#define MIN_SWAPPINESS 0
> +#define MAX_SWAPPINESS 200
> extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> unsigned long nr_pages,
> gfp_t gfp_mask,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b226090fd906..fbe9f02dd206 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4337,7 +4337,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>
> - if (val > 200)
> + if (val > MAX_SWAPPINESS)
> return -EINVAL;
>
> if (!mem_cgroup_is_root(memcg))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9dd8977de5a2..d91963e2d47f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -183,7 +183,7 @@ struct scan_control {
> #endif
>
> /*
> - * From 0 .. 200. Higher means more swappy.
> + * From 0 .. MAX_SWAPPINESS. Higher means more swappy.
> */
> int vm_swappiness = 60;
>
> @@ -2403,7 +2403,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> ap = swappiness * (total_cost + 1);
> ap /= anon_cost + 1;
>
> - fp = (200 - swappiness) * (total_cost + 1);
> + fp = (MAX_SWAPPINESS - swappiness) * (total_cost + 1);
> fp /= file_cost + 1;
>
> fraction[0] = ap;
> @@ -4400,7 +4400,7 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx
> {
> int type, tier;
> struct ctrl_pos sp, pv;
> - int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
> + int gain[ANON_AND_FILE] = { swappiness, MAX_SWAPPINESS - swappiness };
>
> /*
> * Compare the first tier of anon with that of file to determine which
> @@ -4436,7 +4436,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> /*
> * Try to make the obvious choice first. When anon and file are both
> * available from the same generation, interpret swappiness 1 as file
> - * first and 200 as anon first.
> + * first and MAX_SWAPPINESS as anon first.
> */
> if (!swappiness)
> type = LRU_GEN_FILE;
> @@ -4444,7 +4444,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> type = LRU_GEN_ANON;
> else if (swappiness == 1)
> type = LRU_GEN_FILE;
> - else if (swappiness == 200)
> + else if (swappiness == MAX_SWAPPINESS)
> type = LRU_GEN_ANON;
> else
> type = get_type_to_scan(lruvec, swappiness, &tier);
> @@ -5398,9 +5398,9 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
>
> lruvec = get_lruvec(memcg, nid);
>
> - if (swappiness < 0)
> + if (swappiness < MIN_SWAPPINESS)
> swappiness = get_swappiness(lruvec, sc);
> - else if (swappiness > 200)
> + else if (swappiness > MAX_SWAPPINESS)
> goto done;
>
> switch (cmd) {
> --
> 2.39.3
>

Sorry for being late to the party :) Was preoccupied with zswap convos.

Hmm these are all the occurrences of 0 and 200 (in the context of
swappiness) I can dig out. So, FWIW:
Reviewed-by: Nhat Pham <[email protected]>

2023-12-21 09:53:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

On Wed 20-12-23 07:26:51, Dan Schatzberg wrote:
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Userspace proactive reclaimers use the memory.reclaim interface to
> trigger reclaim. The memory.reclaim interface does not allow for any way
> to effect the balance of file vs anon during proactive reclaim. The only
> approach is to adjust the vm.swappiness setting. However, there are a
> few reasons we look to control the balance of file vs anon during
> proactive reclaim, separately from reactive reclaim:
>
> * Swapout should be limited to manage SSD write endurance. In near-OOM
> situations we are fine with lots of swap-out to avoid OOMs. As these are
> typically rare events, they have relatively little impact on write
> endurance. However, proactive reclaim runs continuously and so its
> impact on SSD write endurance is more significant. Therefore it is
> desireable to control swap-out for proactive reclaim separately from
> reactive reclaim
>
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
> swap exhaustion. This makes sense if the swap exhaustion is triggered
> due to reactive reclaim but less so if it is triggered due to proactive
> reclaim (e.g. one could see OOMs when free memory is ample but anon is
> just particularly cold). Therefore, it's desireable to have proactive
> reclaim reduce or stop swap-out before the threshold at which OOM
> killing occurs.
>
> In the case of Meta's Senpai proactive reclaimer, we adjust
> vm.swappiness before writes to memory.reclaim[2]. This has been in
> production for nearly two years and has addressed our needs to control
> proactive vs reactive reclaim behavior but is still not ideal for a
> number of reasons:
>
> * vm.swappiness is a global setting, adjusting it can race/interfere
> with other system administration that wishes to control vm.swappiness.
> In our case, we need to disable Senpai before adjusting vm.swappiness.
>
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave
> a misconfigured setting. This requires some additional management to
> record the "desired" setting and ensure Senpai always adjusts to it.
>
> With this patch, we avoid these downsides of adjusting vm.swappiness
> globally.

Thank you for extending the changelog with usecases!

> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
>
> Signed-off-by: Dan Schatzberg <[email protected]>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
> include/linux/swap.h | 3 +-
> mm/memcontrol.c | 56 ++++++++++++++++++++-----
> mm/vmscan.c | 13 +++++-
> 4 files changed, 69 insertions(+), 21 deletions(-)

LGTM
Acked-by: Michal Hocko <[email protected].

Just one minor thing. It would be really great to prevent from potential
incorrect use of mem_cgroup_swappiness. This should be internal function
to memcg. Now, having scan_control internal to vmscan.c makes that
harder and moving it out to swap.h or internal.h sounds overreaching.

We could do this at least to reduce those mistakes. I can make it a
proper patch if this seems reasonable or you can fold it into your patch
directly.
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f98dff23b758..5f3a182e9515 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -92,8 +92,10 @@ struct scan_control {
unsigned long anon_cost;
unsigned long file_cost;

- /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */
+#ifdef CONFIG_MEMCG
+ /* Swappiness value for reclaim. Always use sc_swappiness()! */
int *swappiness;
+#endif

/* Can active folios be deactivated as part of reclaim? */
#define DEACTIVATE_ANON 1
@@ -230,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
#endif
return false;
}
+
+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+ if (sc->swappiness)
+ return *sc->swappiness;
+ return mem_cgroup_swappiness(memcg);
+}
#else
static bool cgroup_reclaim(struct scan_control *sc)
{
@@ -245,6 +254,10 @@ static bool writeback_throttling_sane(struct scan_control *sc)
{
return true;
}
+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+ return READ_ONCE(vm_swappiness);
+}
#endif

static void set_task_reclaim_state(struct task_struct *task,
@@ -2330,8 +2343,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
unsigned long anon_cost, file_cost, total_cost;
- int swappiness = sc->swappiness ?
- *sc->swappiness : mem_cgroup_swappiness(memcg);
+ int swappiness = sc_swappiness(sc, memcg);
u64 fraction[ANON_AND_FILE];
u64 denominator = 0; /* gcc */
enum scan_balance scan_balance;
@@ -2612,10 +2624,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
return 0;

- if (sc->swappiness)
- return *sc->swappiness;
-
- return mem_cgroup_swappiness(memcg);
+ return sc_swappiness(sc, memcg);
}

static int get_nr_gens(struct lruvec *lruvec, int type)
--
Michal Hocko
SUSE Labs

2023-12-22 04:38:50

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

On Wed, 20 Dec 2023, Dan Schatzberg wrote:

> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Userspace proactive reclaimers use the memory.reclaim interface to
> trigger reclaim. The memory.reclaim interface does not allow for any way
> to effect the balance of file vs anon during proactive reclaim. The only
> approach is to adjust the vm.swappiness setting. However, there are a
> few reasons we look to control the balance of file vs anon during
> proactive reclaim, separately from reactive reclaim:
>
> * Swapout should be limited to manage SSD write endurance. In near-OOM
> situations we are fine with lots of swap-out to avoid OOMs. As these are
> typically rare events, they have relatively little impact on write
> endurance. However, proactive reclaim runs continuously and so its
> impact on SSD write endurance is more significant. Therefore it is
> desireable to control swap-out for proactive reclaim separately from
> reactive reclaim
>
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
> swap exhaustion. This makes sense if the swap exhaustion is triggered
> due to reactive reclaim but less so if it is triggered due to proactive
> reclaim (e.g. one could see OOMs when free memory is ample but anon is
> just particularly cold). Therefore, it's desireable to have proactive
> reclaim reduce or stop swap-out before the threshold at which OOM
> killing occurs.
>
> In the case of Meta's Senpai proactive reclaimer, we adjust
> vm.swappiness before writes to memory.reclaim[2]. This has been in
> production for nearly two years and has addressed our needs to control
> proactive vs reactive reclaim behavior but is still not ideal for a
> number of reasons:
>
> * vm.swappiness is a global setting, adjusting it can race/interfere
> with other system administration that wishes to control vm.swappiness.
> In our case, we need to disable Senpai before adjusting vm.swappiness.
>
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave
> a misconfigured setting. This requires some additional management to
> record the "desired" setting and ensure Senpai always adjusts to it.
>
> With this patch, we avoid these downsides of adjusting vm.swappiness
> globally.
>
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
>
> Signed-off-by: Dan Schatzberg <[email protected]>

Acked-by: David Rientjes <[email protected]>

2023-12-22 04:39:37

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] mm: add defines for min/max swappiness

On Wed, 20 Dec 2023, Dan Schatzberg wrote:

> We use the constants 0 and 200 in a few places in the mm code when
> referring to the min and max swappiness. This patch adds MIN_SWAPPINESS
> and MAX_SWAPPINESS #defines to improve clarity. There are no functional
> changes.
>
> Signed-off-by: Dan Schatzberg <[email protected]>

Acked-by: David Rientjes <[email protected]>

2023-12-22 05:32:48

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

On Wed, Dec 20, 2023 at 8:27 AM Dan Schatzberg <[email protected]> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Userspace proactive reclaimers use the memory.reclaim interface to
> trigger reclaim. The memory.reclaim interface does not allow for any way
> to effect the balance of file vs anon during proactive reclaim. The only
> approach is to adjust the vm.swappiness setting. However, there are a
> few reasons we look to control the balance of file vs anon during
> proactive reclaim, separately from reactive reclaim:
>
> * Swapout should be limited to manage SSD write endurance. In near-OOM
> situations we are fine with lots of swap-out to avoid OOMs. As these are
> typically rare events, they have relatively little impact on write
> endurance. However, proactive reclaim runs continuously and so its
> impact on SSD write endurance is more significant. Therefore it is
> desireable to control swap-out for proactive reclaim separately from
> reactive reclaim
>
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
> swap exhaustion. This makes sense if the swap exhaustion is triggered
> due to reactive reclaim but less so if it is triggered due to proactive
> reclaim (e.g. one could see OOMs when free memory is ample but anon is
> just particularly cold). Therefore, it's desireable to have proactive
> reclaim reduce or stop swap-out before the threshold at which OOM
> killing occurs.
>
> In the case of Meta's Senpai proactive reclaimer, we adjust
> vm.swappiness before writes to memory.reclaim[2]. This has been in
> production for nearly two years and has addressed our needs to control
> proactive vs reactive reclaim behavior but is still not ideal for a
> number of reasons:
>
> * vm.swappiness is a global setting, adjusting it can race/interfere
> with other system administration that wishes to control vm.swappiness.
> In our case, we need to disable Senpai before adjusting vm.swappiness.
>
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave
> a misconfigured setting. This requires some additional management to
> record the "desired" setting and ensure Senpai always adjusts to it.
>
> With this patch, we avoid these downsides of adjusting vm.swappiness
> globally.
>
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
>
> Signed-off-by: Dan Schatzberg <[email protected]>

The cover letter says:
"Previously, this exact interface addition was proposed by Yosry[3]."

So I think it should be acknowledged with a Suggested-by, based on:
"A Suggested-by: tag indicates that the patch idea is suggested by the
person named and ensures credit to the person for the idea."
from
https://docs.kernel.org/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes

> Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
> include/linux/swap.h | 3 +-
> mm/memcontrol.c | 56 ++++++++++++++++++++-----
> mm/vmscan.c | 13 +++++-
> 4 files changed, 69 insertions(+), 21 deletions(-)

...

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d91963e2d47f..aa5666842c49 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,6 +92,9 @@ struct scan_control {
> unsigned long anon_cost;
> unsigned long file_cost;
>
> + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */
> + int *swappiness;

Using a pointer to indicate whether the type it points to is
overridden isn't really a good practice.

A better alternative was suggested during the v2:
"Perhaps the negative to avoid unnecessary dereferences."
https://lore.kernel.org/linux-mm/dhhjw4h22q4ngwtxmhuyifv32zjd6z2relrcjgnxsw6zys3mod@o6dh5dy53ae3/

Since only proactive reclaim can override swappiness, meaning it only
happens if sc->proactive is true, I think the best way to make it work
without spending much effort is create a helper as Michal suggest but
it should look like:

sc_swappiness()
{
return sc->proactive ? sc->swappiness : mem_cgroup_swappiness(memcg);
}

In this patchset, sc->swappiness really means
sc->proactive_swappiness. So it should be renamed accordingly.

2023-12-24 17:15:10

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] mm: add defines for min/max swappiness

Hi Dan,

Acked-by: Chris Li <[email protected]>

Chris

On Wed, Dec 20, 2023 at 7:27 AM Dan Schatzberg <[email protected]> wrote:
>
> We use the constants 0 and 200 in a few places in the mm code when
> referring to the min and max swappiness. This patch adds MIN_SWAPPINESS
> and MAX_SWAPPINESS #defines to improve clarity. There are no functional
> changes.
>
> Signed-off-by: Dan Schatzberg <[email protected]>
> ---
> include/linux/swap.h | 2 ++
> mm/memcontrol.c | 2 +-
> mm/vmscan.c | 14 +++++++-------
> 3 files changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f6dd6575b905..e2ab76c25b4a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -407,6 +407,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>
> #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
> #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
> +#define MIN_SWAPPINESS 0
> +#define MAX_SWAPPINESS 200
> extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> unsigned long nr_pages,
> gfp_t gfp_mask,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b226090fd906..fbe9f02dd206 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4337,7 +4337,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>
> - if (val > 200)
> + if (val > MAX_SWAPPINESS)
> return -EINVAL;
>
> if (!mem_cgroup_is_root(memcg))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9dd8977de5a2..d91963e2d47f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -183,7 +183,7 @@ struct scan_control {
> #endif
>
> /*
> - * From 0 .. 200. Higher means more swappy.
> + * From 0 .. MAX_SWAPPINESS. Higher means more swappy.
> */
> int vm_swappiness = 60;
>
> @@ -2403,7 +2403,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> ap = swappiness * (total_cost + 1);
> ap /= anon_cost + 1;
>
> - fp = (200 - swappiness) * (total_cost + 1);
> + fp = (MAX_SWAPPINESS - swappiness) * (total_cost + 1);
> fp /= file_cost + 1;
>
> fraction[0] = ap;
> @@ -4400,7 +4400,7 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx
> {
> int type, tier;
> struct ctrl_pos sp, pv;
> - int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
> + int gain[ANON_AND_FILE] = { swappiness, MAX_SWAPPINESS - swappiness };
>
> /*
> * Compare the first tier of anon with that of file to determine which
> @@ -4436,7 +4436,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> /*
> * Try to make the obvious choice first. When anon and file are both
> * available from the same generation, interpret swappiness 1 as file
> - * first and 200 as anon first.
> + * first and MAX_SWAPPINESS as anon first.
> */
> if (!swappiness)
> type = LRU_GEN_FILE;
> @@ -4444,7 +4444,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> type = LRU_GEN_ANON;
> else if (swappiness == 1)
> type = LRU_GEN_FILE;
> - else if (swappiness == 200)
> + else if (swappiness == MAX_SWAPPINESS)
> type = LRU_GEN_ANON;
> else
> type = get_type_to_scan(lruvec, swappiness, &tier);
> @@ -5398,9 +5398,9 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
>
> lruvec = get_lruvec(memcg, nid);
>
> - if (swappiness < 0)
> + if (swappiness < MIN_SWAPPINESS)
> swappiness = get_swappiness(lruvec, sc);
> - else if (swappiness > 200)
> + else if (swappiness > MAX_SWAPPINESS)
> goto done;
>
> switch (cmd) {
> --
> 2.39.3
>
>

2023-12-24 17:21:53

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

Hi Dan,

Thanks for removing the -1 special cases.

Acked-by: Chris Li <[email protected]>

Chris

On Wed, Dec 20, 2023 at 7:27 AM Dan Schatzberg <[email protected]> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Userspace proactive reclaimers use the memory.reclaim interface to
> trigger reclaim. The memory.reclaim interface does not allow for any way
> to effect the balance of file vs anon during proactive reclaim. The only
> approach is to adjust the vm.swappiness setting. However, there are a
> few reasons we look to control the balance of file vs anon during
> proactive reclaim, separately from reactive reclaim:
>
> * Swapout should be limited to manage SSD write endurance. In near-OOM
> situations we are fine with lots of swap-out to avoid OOMs. As these are
> typically rare events, they have relatively little impact on write
> endurance. However, proactive reclaim runs continuously and so its
> impact on SSD write endurance is more significant. Therefore it is
> desireable to control swap-out for proactive reclaim separately from
> reactive reclaim
>
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
> swap exhaustion. This makes sense if the swap exhaustion is triggered
> due to reactive reclaim but less so if it is triggered due to proactive
> reclaim (e.g. one could see OOMs when free memory is ample but anon is
> just particularly cold). Therefore, it's desireable to have proactive
> reclaim reduce or stop swap-out before the threshold at which OOM
> killing occurs.
>
> In the case of Meta's Senpai proactive reclaimer, we adjust
> vm.swappiness before writes to memory.reclaim[2]. This has been in
> production for nearly two years and has addressed our needs to control
> proactive vs reactive reclaim behavior but is still not ideal for a
> number of reasons:
>
> * vm.swappiness is a global setting, adjusting it can race/interfere
> with other system administration that wishes to control vm.swappiness.
> In our case, we need to disable Senpai before adjusting vm.swappiness.
>
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave
> a misconfigured setting. This requires some additional management to
> record the "desired" setting and ensure Senpai always adjusts to it.
>
> With this patch, we avoid these downsides of adjusting vm.swappiness
> globally.
>
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
>
> Signed-off-by: Dan Schatzberg <[email protected]>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
> include/linux/swap.h | 3 +-
> mm/memcontrol.c | 56 ++++++++++++++++++++-----
> mm/vmscan.c | 13 +++++-
> 4 files changed, 69 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3f85254f3cef..ee42f74e0765 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
> This is a simple interface to trigger memory reclaim in the
> target cgroup.
>
> - This file accepts a single key, the number of bytes to reclaim.
> - No nested keys are currently supported.
> -
> Example::
>
> echo "1G" > memory.reclaim
>
> - The interface can be later extended with nested keys to
> - configure the reclaim behavior. For example, specify the
> - type of memory to reclaim from (anon, file, ..).
> -
> Please note that the kernel can over or under reclaim from
> the target cgroup. If less bytes are reclaimed than the
> specified amount, -EAGAIN is returned.
> @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back.
> This means that the networking layer will not adapt based on
> reclaim induced by memory.reclaim.
>
> +The following nested keys are defined.
> +
> + ========== ================================
> + swappiness Swappiness value to reclaim with
> + ========== ================================
> +
> + Specifying a swappiness value instructs the kernel to perform
> + the reclaim with that swappiness value. Note that this has the
> + same semantics as vm.swappiness applied to memcg reclaim with
> + all the existing limitations and potential future extensions.
> +
> memory.peak
> A read-only single value file which exists on non-root
> cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e2ab76c25b4a..8afdec40efe3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> unsigned long nr_pages,
> gfp_t gfp_mask,
> - unsigned int reclaim_options);
> + unsigned int reclaim_options,
> + int *swappiness);
> extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> gfp_t gfp_mask, bool noswap,
> pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fbe9f02dd206..6d627a754851 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -52,6 +52,7 @@
> #include <linux/sort.h>
> #include <linux/fs.h>
> #include <linux/seq_file.h>
> +#include <linux/parser.h>
> #include <linux/vmpressure.h>
> #include <linux/memremap.h>
> #include <linux/mm_inline.h>
> @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> psi_memstall_enter(&pflags);
> nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
> gfp_mask,
> - MEMCG_RECLAIM_MAY_SWAP);
> + MEMCG_RECLAIM_MAY_SWAP,
> + NULL);
> psi_memstall_leave(&pflags);
> } while ((memcg = parent_mem_cgroup(memcg)) &&
> !mem_cgroup_is_root(memcg));
> @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
> psi_memstall_enter(&pflags);
> nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> - gfp_mask, reclaim_options);
> + gfp_mask, reclaim_options, NULL);
> psi_memstall_leave(&pflags);
>
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> }
>
> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
> ret = -EBUSY;
> break;
> }
> @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> return -EINTR;
>
> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> - MEMCG_RECLAIM_MAY_SWAP))
> + MEMCG_RECLAIM_MAY_SWAP, NULL))
> nr_retries--;
> }
>
> @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> }
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
>
> if (!reclaimed && !nr_retries--)
> break;
> @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
> if (nr_reclaims) {
> if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
> nr_reclaims--;
> continue;
> }
> @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +enum {
> + MEMORY_RECLAIM_SWAPPINESS = 0,
> + MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> + { MEMORY_RECLAIM_NULL, NULL },
> +};
> +
> static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> size_t nbytes, loff_t off)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> unsigned long nr_to_reclaim, nr_reclaimed = 0;
> + int swappiness = -1;
> unsigned int reclaim_options;
> - int err;
> + char *old_buf, *start;
> + substring_t args[MAX_OPT_ARGS];
>
> buf = strstrip(buf);
> - err = page_counter_memparse(buf, "", &nr_to_reclaim);
> - if (err)
> - return err;
> +
> + old_buf = buf;
> + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> + if (buf == old_buf)
> + return -EINVAL;
> +
> + buf = strstrip(buf);
> +
> + while ((start = strsep(&buf, " ")) != NULL) {
> + if (!strlen(start))
> + continue;
> + switch (match_token(start, tokens, args)) {
> + case MEMORY_RECLAIM_SWAPPINESS:
> + if (match_int(&args[0], &swappiness))
> + return -EINVAL;
> + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> + return -EINVAL;
> + break;
> + default:
> + return -EINVAL;
> + }
> + }
>
> reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> while (nr_reclaimed < nr_to_reclaim) {
> @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg,
> min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> - GFP_KERNEL, reclaim_options);
> + GFP_KERNEL, reclaim_options,
> + swappiness == -1 ? NULL : &swappiness);
>
> if (!reclaimed && !nr_retries--)
> return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d91963e2d47f..aa5666842c49 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,6 +92,9 @@ struct scan_control {
> unsigned long anon_cost;
> unsigned long file_cost;
>
> + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */
> + int *swappiness;
> +
> /* Can active folios be deactivated as part of reclaim? */
> #define DEACTIVATE_ANON 1
> #define DEACTIVATE_FILE 2
> @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> unsigned long anon_cost, file_cost, total_cost;
> - int swappiness = mem_cgroup_swappiness(memcg);
> + int swappiness = sc->swappiness ?
> + *sc->swappiness : mem_cgroup_swappiness(memcg);
> u64 fraction[ANON_AND_FILE];
> u64 denominator = 0; /* gcc */
> enum scan_balance scan_balance;
> @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> return 0;
>
> + if (sc->swappiness)
> + return *sc->swappiness;
> +
> return mem_cgroup_swappiness(memcg);
> }
>
> @@ -6463,12 +6470,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> unsigned long nr_pages,
> gfp_t gfp_mask,
> - unsigned int reclaim_options)
> + unsigned int reclaim_options,
> + int *swappiness)
> {
> unsigned long nr_reclaimed;
> unsigned int noreclaim_flag;
> struct scan_control sc = {
> .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> + .swappiness = swappiness,
> .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> .reclaim_idx = MAX_NR_ZONES - 1,
> --
> 2.39.3
>
>

2024-01-02 15:22:06

by Dan Schatzberg

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

Hi Yu Zhao,

Thanks for the feedback, sorry for the delayed response.

On Thu, Dec 21, 2023 at 10:31:59PM -0700, Yu Zhao wrote:
> On Wed, Dec 20, 2023 at 8:27 AM Dan Schatzberg <[email protected]> wrote:
> >
> > ...
>
> The cover letter says:
> "Previously, this exact interface addition was proposed by Yosry[3]."
>
> So I think it should be acknowledged with a Suggested-by, based on:
> "A Suggested-by: tag indicates that the patch idea is suggested by the
> person named and ensures credit to the person for the idea."
> from
> https://docs.kernel.org/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes

Sure, will do.

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d91963e2d47f..aa5666842c49 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -92,6 +92,9 @@ struct scan_control {
> > unsigned long anon_cost;
> > unsigned long file_cost;
> >
> > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */
> > + int *swappiness;
>
> Using a pointer to indicate whether the type it points to is
> overridden isn't really a good practice.
>
> A better alternative was suggested during the v2:
> "Perhaps the negative to avoid unnecessary dereferences."
> https://lore.kernel.org/linux-mm/dhhjw4h22q4ngwtxmhuyifv32zjd6z2relrcjgnxsw6zys3mod@o6dh5dy53ae3/

I did have a couple versions with a negative but it creates
initialization issues where every instantiation of scan_control needs
to make sure to initialize swappiness or else it will behave as if
swappiness is 0. That's pretty error prone so using the pointer seemed
the better approach.

> Since only proactive reclaim can override swappiness, meaning it only
> happens if sc->proactive is true, I think the best way to make it work
> without spending much effort is create a helper as Michal suggest but
> it should look like:
>
> sc_swappiness()
> {
> return sc->proactive ? sc->swappiness : mem_cgroup_swappiness(memcg);
> }
>
> In this patchset, sc->swappiness really means
> sc->proactive_swappiness. So it should be renamed accordingly.

Helper aside, I disagree with this point about coupling with the
proactive flag. The fact that the only user currently is proactive
reclaim doesn't imply to me that the interface (in scan_control)
should be coupled to the use-case. It's easier to reason about a
swappiness field that overrides swappiness for all scans that set it
regardless of the users.

2024-01-02 17:44:15

by Dan Schatzberg

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

On Thu, Dec 21, 2023 at 10:29:59AM +0100, Michal Hocko wrote:
> On Wed 20-12-23 07:26:51, Dan Schatzberg wrote:
> > ...
>
> LGTM
> Acked-by: Michal Hocko <[email protected].
>
> Just one minor thing. It would be really great to prevent from potential
> incorrect use of mem_cgroup_swappiness. This should be internal function
> to memcg. Now, having scan_control internal to vmscan.c makes that
> harder and moving it out to swap.h or internal.h sounds overreaching.
>
> We could do this at least to reduce those mistakes. I can make it a
> proper patch if this seems reasonable or you can fold it into your patch
> directly.
> ---
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f98dff23b758..5f3a182e9515 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,8 +92,10 @@ struct scan_control {
> unsigned long anon_cost;
> unsigned long file_cost;
>
> - /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */
> +#ifdef CONFIG_MEMCG
> + /* Swappiness value for reclaim. Always use sc_swappiness()! */
> int *swappiness;
> +#endif
>
> /* Can active folios be deactivated as part of reclaim? */
> #define DEACTIVATE_ANON 1
> @@ -230,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> #endif
> return false;
> }
> +
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> + if (sc->swappiness)
> + return *sc->swappiness;
> + return mem_cgroup_swappiness(memcg);
> +}
> #else
> static bool cgroup_reclaim(struct scan_control *sc)
> {
> @@ -245,6 +254,10 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> {
> return true;
> }
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> + return READ_ONCE(vm_swappiness);
> +}
> #endif
>
> static void set_task_reclaim_state(struct task_struct *task,
> @@ -2330,8 +2343,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> unsigned long anon_cost, file_cost, total_cost;
> - int swappiness = sc->swappiness ?
> - *sc->swappiness : mem_cgroup_swappiness(memcg);
> + int swappiness = sc_swappiness(sc, memcg);
> u64 fraction[ANON_AND_FILE];
> u64 denominator = 0; /* gcc */
> enum scan_balance scan_balance;
> @@ -2612,10 +2624,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> return 0;
>
> - if (sc->swappiness)
> - return *sc->swappiness;
> -
> - return mem_cgroup_swappiness(memcg);
> + return sc_swappiness(sc, memcg);
> }
>
> static int get_nr_gens(struct lruvec *lruvec, int type)
> --
> Michal Hocko
> SUSE Labs

Thanks for the review Michal and sorry for the delayed response. Your
patch looks reasonable to me but I'm a bit unclear about the need for
#ifdef - mem_cgroup_swappiness already works correctly regardless of
CONFIG_MEMCG or not - why not make sc->swappiness and sc_swappiness()
unconditional?

Happy to roll that into the next version of my patch.

2024-01-02 22:42:39

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

On Wed, Dec 20, 2023 at 7:27 AM Dan Schatzberg <[email protected]> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Userspace proactive reclaimers use the memory.reclaim interface to
> trigger reclaim. The memory.reclaim interface does not allow for any way
> to effect the balance of file vs anon during proactive reclaim. The only
> approach is to adjust the vm.swappiness setting. However, there are a
> few reasons we look to control the balance of file vs anon during
> proactive reclaim, separately from reactive reclaim:
>
> * Swapout should be limited to manage SSD write endurance. In near-OOM
> situations we are fine with lots of swap-out to avoid OOMs. As these are
> typically rare events, they have relatively little impact on write
> endurance. However, proactive reclaim runs continuously and so its
> impact on SSD write endurance is more significant. Therefore it is
> desireable to control swap-out for proactive reclaim separately from
> reactive reclaim
>
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
> swap exhaustion. This makes sense if the swap exhaustion is triggered
> due to reactive reclaim but less so if it is triggered due to proactive
> reclaim (e.g. one could see OOMs when free memory is ample but anon is
> just particularly cold). Therefore, it's desireable to have proactive
> reclaim reduce or stop swap-out before the threshold at which OOM
> killing occurs.
>
> In the case of Meta's Senpai proactive reclaimer, we adjust
> vm.swappiness before writes to memory.reclaim[2]. This has been in
> production for nearly two years and has addressed our needs to control
> proactive vs reactive reclaim behavior but is still not ideal for a
> number of reasons:
>
> * vm.swappiness is a global setting, adjusting it can race/interfere
> with other system administration that wishes to control vm.swappiness.
> In our case, we need to disable Senpai before adjusting vm.swappiness.
>
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave
> a misconfigured setting. This requires some additional management to
> record the "desired" setting and ensure Senpai always adjusts to it.
>
> With this patch, we avoid these downsides of adjusting vm.swappiness
> globally.
>
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
>
> Signed-off-by: Dan Schatzberg <[email protected]>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
> include/linux/swap.h | 3 +-
> mm/memcontrol.c | 56 ++++++++++++++++++++-----
> mm/vmscan.c | 13 +++++-
> 4 files changed, 69 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3f85254f3cef..ee42f74e0765 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
> This is a simple interface to trigger memory reclaim in the
> target cgroup.
>
> - This file accepts a single key, the number of bytes to reclaim.
> - No nested keys are currently supported.
> -
> Example::
>
> echo "1G" > memory.reclaim
>
> - The interface can be later extended with nested keys to
> - configure the reclaim behavior. For example, specify the
> - type of memory to reclaim from (anon, file, ..).
> -
> Please note that the kernel can over or under reclaim from
> the target cgroup. If less bytes are reclaimed than the
> specified amount, -EAGAIN is returned.
> @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back.
> This means that the networking layer will not adapt based on
> reclaim induced by memory.reclaim.
>
> +The following nested keys are defined.
> +
> + ========== ================================
> + swappiness Swappiness value to reclaim with
> + ========== ================================
> +
> + Specifying a swappiness value instructs the kernel to perform
> + the reclaim with that swappiness value. Note that this has the
> + same semantics as vm.swappiness applied to memcg reclaim with
> + all the existing limitations and potential future extensions.
> +
> memory.peak
> A read-only single value file which exists on non-root
> cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e2ab76c25b4a..8afdec40efe3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> unsigned long nr_pages,
> gfp_t gfp_mask,
> - unsigned int reclaim_options);
> + unsigned int reclaim_options,
> + int *swappiness);
> extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> gfp_t gfp_mask, bool noswap,
> pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fbe9f02dd206..6d627a754851 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -52,6 +52,7 @@
> #include <linux/sort.h>
> #include <linux/fs.h>
> #include <linux/seq_file.h>
> +#include <linux/parser.h>
> #include <linux/vmpressure.h>
> #include <linux/memremap.h>
> #include <linux/mm_inline.h>
> @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> psi_memstall_enter(&pflags);
> nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
> gfp_mask,
> - MEMCG_RECLAIM_MAY_SWAP);
> + MEMCG_RECLAIM_MAY_SWAP,
> + NULL);
> psi_memstall_leave(&pflags);
> } while ((memcg = parent_mem_cgroup(memcg)) &&
> !mem_cgroup_is_root(memcg));
> @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
> psi_memstall_enter(&pflags);
> nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> - gfp_mask, reclaim_options);
> + gfp_mask, reclaim_options, NULL);
> psi_memstall_leave(&pflags);
>
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> }
>
> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
> ret = -EBUSY;
> break;
> }
> @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> return -EINTR;
>
> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> - MEMCG_RECLAIM_MAY_SWAP))
> + MEMCG_RECLAIM_MAY_SWAP, NULL))
> nr_retries--;
> }
>
> @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> }
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
>
> if (!reclaimed && !nr_retries--)
> break;
> @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
> if (nr_reclaims) {
> if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
> nr_reclaims--;
> continue;
> }
> @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +enum {
> + MEMORY_RECLAIM_SWAPPINESS = 0,
> + MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> + { MEMORY_RECLAIM_NULL, NULL },
> +};
> +
> static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> size_t nbytes, loff_t off)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> unsigned long nr_to_reclaim, nr_reclaimed = 0;
> + int swappiness = -1;
> unsigned int reclaim_options;
> - int err;
> + char *old_buf, *start;
> + substring_t args[MAX_OPT_ARGS];
>
> buf = strstrip(buf);
> - err = page_counter_memparse(buf, "", &nr_to_reclaim);
> - if (err)
> - return err;
> +
> + old_buf = buf;
> + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> + if (buf == old_buf)
> + return -EINVAL;
> +
> + buf = strstrip(buf);
> +
> + while ((start = strsep(&buf, " ")) != NULL) {
> + if (!strlen(start))
> + continue;
> + switch (match_token(start, tokens, args)) {
> + case MEMORY_RECLAIM_SWAPPINESS:
> + if (match_int(&args[0], &swappiness))
> + return -EINVAL;
> + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> + return -EINVAL;
> + break;
> + default:
> + return -EINVAL;
> + }
> + }
>
> reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> while (nr_reclaimed < nr_to_reclaim) {
> @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg,
> min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> - GFP_KERNEL, reclaim_options);
> + GFP_KERNEL, reclaim_options,
> + swappiness == -1 ? NULL : &swappiness);
>
> if (!reclaimed && !nr_retries--)
> return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d91963e2d47f..aa5666842c49 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,6 +92,9 @@ struct scan_control {
> unsigned long anon_cost;
> unsigned long file_cost;
>
> + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */
> + int *swappiness;
> +
> /* Can active folios be deactivated as part of reclaim? */
> #define DEACTIVATE_ANON 1
> #define DEACTIVATE_FILE 2
> @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> unsigned long anon_cost, file_cost, total_cost;
> - int swappiness = mem_cgroup_swappiness(memcg);
> + int swappiness = sc->swappiness ?
> + *sc->swappiness : mem_cgroup_swappiness(memcg);
> u64 fraction[ANON_AND_FILE];
> u64 denominator = 0; /* gcc */
> enum scan_balance scan_balance;
> @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> return 0;
>
> + if (sc->swappiness)
> + return *sc->swappiness;
> +
> return mem_cgroup_swappiness(memcg);
> }
>
> @@ -6463,12 +6470,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> unsigned long nr_pages,
> gfp_t gfp_mask,
> - unsigned int reclaim_options)
> + unsigned int reclaim_options,
> + int *swappiness)
> {
> unsigned long nr_reclaimed;
> unsigned int noreclaim_flag;
> struct scan_control sc = {
> .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> + .swappiness = swappiness,
> .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> .reclaim_idx = MAX_NR_ZONES - 1,
> --
> 2.39.3
>

The swappiness = -1 part seems a bit awkward, but other LGTM.
Reviewed-by: Nhat Pham <[email protected]>

2024-01-03 00:28:08

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

On Tue, Jan 2, 2024 at 8:21 AM Dan Schatzberg <[email protected]> wrote:
>
> Hi Yu Zhao,
>
> Thanks for the feedback, sorry for the delayed response.
>
> On Thu, Dec 21, 2023 at 10:31:59PM -0700, Yu Zhao wrote:
> > On Wed, Dec 20, 2023 at 8:27 AM Dan Schatzberg <[email protected]> wrote:
> > >
> > > ...
> >
> > The cover letter says:
> > "Previously, this exact interface addition was proposed by Yosry[3]."
> >
> > So I think it should be acknowledged with a Suggested-by, based on:
> > "A Suggested-by: tag indicates that the patch idea is suggested by the
> > person named and ensures credit to the person for the idea."
> > from
> > https://docs.kernel.org/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes
>
> Sure, will do.
>
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d91963e2d47f..aa5666842c49 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -92,6 +92,9 @@ struct scan_control {
> > > unsigned long anon_cost;
> > > unsigned long file_cost;
> > >
> > > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */
> > > + int *swappiness;
> >
> > Using a pointer to indicate whether the type it points to is
> > overridden isn't really a good practice.
> >
> > A better alternative was suggested during the v2:
> > "Perhaps the negative to avoid unnecessary dereferences."
> > https://lore.kernel.org/linux-mm/dhhjw4h22q4ngwtxmhuyifv32zjd6z2relrcjgnxsw6zys3mod@o6dh5dy53ae3/
>
> I did have a couple versions with a negative but it creates
> initialization issues where every instantiation of scan_control needs
> to make sure to initialize swappiness or else it will behave as if
> swappiness is 0. That's pretty error prone so using the pointer seemed
> the better approach.
>
> > Since only proactive reclaim can override swappiness, meaning it only
> > happens if sc->proactive is true, I think the best way to make it work
> > without spending much effort is create a helper as Michal suggest but
> > it should look like:
> >
> > sc_swappiness()
> > {
> > return sc->proactive ? sc->swappiness : mem_cgroup_swappiness(memcg);
> > }
> >
> > In this patchset, sc->swappiness really means
> > sc->proactive_swappiness. So it should be renamed accordingly.
>
> Helper aside, I disagree with this point about coupling with the
> proactive flag.

Sure. But I would like to hear a *concrete* counterexample.

> The fact that the only user currently is proactive
> reclaim

Yes, that's a fact, and we should make the decision based on the
current known facts.

> doesn't imply to me that the interface (in scan_control)
> should be coupled to the use-case.

Future always has its uncertainty which I would not worry so much about.

> It's easier to reason about a
> swappiness field that overrides swappiness for all scans that set it
> regardless of the users.

For example? And how likely would that happen in the next few years?

2024-01-03 08:59:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

On Tue 02-01-24 12:43:49, Dan Schatzberg wrote:
> On Thu, Dec 21, 2023 at 10:29:59AM +0100, Michal Hocko wrote:
[...]
> Thanks for the review Michal and sorry for the delayed response. Your
> patch looks reasonable to me but I'm a bit unclear about the need for
> #ifdef - mem_cgroup_swappiness already works correctly regardless of
> CONFIG_MEMCG or not - why not make sc->swappiness and sc_swappiness()
> unconditional?

We do not have a different user than memcg pro-active reclaim. Making
that conditional makes that more explicit. Nothing that I would insist
on of course.
--
Michal Hocko
SUSE Labs

2024-01-03 09:03:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

On Tue 02-01-24 10:21:44, Dan Schatzberg wrote:
> Hi Yu Zhao,
>
> Thanks for the feedback, sorry for the delayed response.
>
> On Thu, Dec 21, 2023 at 10:31:59PM -0700, Yu Zhao wrote:
> > On Wed, Dec 20, 2023 at 8:27 AM Dan Schatzberg <[email protected]> wrote:
> > >
> > > ...
> >
> > The cover letter says:
> > "Previously, this exact interface addition was proposed by Yosry[3]."
> >
> > So I think it should be acknowledged with a Suggested-by, based on:
> > "A Suggested-by: tag indicates that the patch idea is suggested by the
> > person named and ensures credit to the person for the idea."
> > from
> > https://docs.kernel.org/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes
>
> Sure, will do.
>
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d91963e2d47f..aa5666842c49 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -92,6 +92,9 @@ struct scan_control {
> > > unsigned long anon_cost;
> > > unsigned long file_cost;
> > >
> > > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */
> > > + int *swappiness;
> >
> > Using a pointer to indicate whether the type it points to is
> > overridden isn't really a good practice.
> >
> > A better alternative was suggested during the v2:
> > "Perhaps the negative to avoid unnecessary dereferences."
> > https://lore.kernel.org/linux-mm/dhhjw4h22q4ngwtxmhuyifv32zjd6z2relrcjgnxsw6zys3mod@o6dh5dy53ae3/
>
> I did have a couple versions with a negative but it creates
> initialization issues where every instantiation of scan_control needs
> to make sure to initialize swappiness or else it will behave as if
> swappiness is 0. That's pretty error prone so using the pointer seemed
> the better approach.

I do agree with this. Especially for an opt-in features it is better if
the default initialization has a clear meanining. In this case even if
somebody doesn't use the helper then the NULL should be caught as NULL
ptr rather than a silent misbehavior.
--
Michal Hocko
SUSE Labs

2024-01-03 15:20:45

by Dan Schatzberg

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: add swapiness= arg to memory.reclaim

On Tue, Jan 02, 2024 at 05:27:18PM -0700, Yu Zhao wrote:
[...]
> > Helper aside, I disagree with this point about coupling with the
> > proactive flag.
>
> Sure. But I would like to hear a *concrete* counterexample.
>
> > The fact that the only user currently is proactive
> > reclaim
>
> Yes, that's a fact, and we should make the decision based on the
> current known facts.
>
> > doesn't imply to me that the interface (in scan_control)
> > should be coupled to the use-case.
>
> Future always has its uncertainty which I would not worry so much about.
>
> > It's easier to reason about a
> > swappiness field that overrides swappiness for all scans that set it
> > regardless of the users.
>
> For example? And how likely would that happen in the next few years?

My argument isn't that making the interface more generic will be
worthwhile due to some future use-case. Rather my argument is that
making the interface more generic makes the code simpler. All else
being equal, having sc->swappiness behave the same regardless of
sc->proactive makes vmscan.c and struct scan_control easier to follow.

That being said - I'm fine with conceding this point - particularly
since both you and Michal appear to feel similarly. I'll make the
corresponding change and send out a new version.