LinuxLists.cc - [PATCH v6 0/2] Add swappiness argument to memory.reclaim

2024-01-03 16:49:25

Subject: [PATCH v6 0/2] Add swappiness argument to memory.reclaim

Changes since V5:
* Made the scan_control behavior limited to proactive reclaim explicitly
* created sc_swappiness helper to reduce chance of mis-use

Changes since V4:
* Fixed some initialization bugs by reverting back to a pointer for swappiness
* Added some more caveats to the behavior of swappiness in documentation

Changes since V3:
* Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
* Added explicit calls to mem_cgroup_swappiness

Changes since V2:
* No functional change
* Used int consistently rather than a pointer

Changes since V1:
* Added documentation

This patch proposes augmenting the memory.reclaim interface with a
swappiness=<val> argument that overrides the swappiness value for that instance
of proactive reclaim.

Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim. The memory.reclaim interface does not allow for any way to effect the
balance of file vs anon during proactive reclaim. The only approach is to adjust
the vm.swappiness setting. However, there are a few reasons we look to control
the balance of file vs anon during proactive reclaim, separately from reactive
reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
situations we are fine with lots of swap-out to avoid OOMs. As these are
typically rare events, they have relatively little impact on write endurance.
However, proactive reclaim runs continuously and so its impact on SSD write
endurance is more significant. Therefore it is desireable to control swap-out
for proactive reclaim separately from reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
exhaustion. This makes sense if the swap exhaustion is triggered due to
reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
one could see OOMs when free memory is ample but anon is just particularly
cold). Therefore, it's desireable to have proactive reclaim reduce or stop
swap-out before the threshold at which OOM killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
writes to memory.reclaim[2]. This has been in production for nearly two years
and has addressed our needs to control proactive vs reactive reclaim behavior
but is still not ideal for a number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere with other
system administration that wishes to control vm.swappiness. In our case, we
need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave a
misconfigured setting. This requires some additional management to record the
"desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness globally.

Previously, this exact interface addition was proposed by Yosry[3]. In response,
Roman proposed instead an interface to specify precise file/anon/slab reclaim
amounts[4]. More recently Huan also proposed this as well[5] and others
similarly questioned if this was the proper interface.

Previous proposals sought to use this to allow proactive reclaimers to
effectively perform a custom reclaim algorithm by issuing proactive reclaim with
different settings to control file vs anon reclaim (e.g. to only reclaim anon
from some applications). Responses argued that adjusting swappiness is a poor
interface for custom reclaim.

In contrast, I argue in favor of a swappiness setting not as a way to implement
custom reclaim algorithms but rather to bias the balance of anon vs file due to
differences of proactive vs reactive reclaim. In this context, swappiness is the
existing interface for controlling this balance and this patch simply allows for
it to be configured differently for proactive vs reactive reclaim.

Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
for this prupose. Proactive reclaimers are un-aware of the relative age of file
vs anon for a cgroup which makes it difficult to manage proactive reclaim of
different memory pools. A proactive reclaimer would need some amount of anon
reclaim attempts separate from the amount of file reclaim attempts which seems
brittle given that it's difficult to observe the impact.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
[3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
[4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
[5]https://lore.kernel.org/lkml/[email protected]/

Dan Schatzberg (2):
mm: add defines for min/max swappiness
mm: add swapiness= arg to memory.reclaim

Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
include/linux/swap.h | 5 ++-
mm/memcontrol.c | 58 ++++++++++++++++++++-----
mm/vmscan.c | 39 ++++++++++++-----
4 files changed, 90 insertions(+), 30 deletions(-)

--
2.39.3

2024-01-03 16:50:22

by Dan Schatzberg

[permalink] [raw]

Subject: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

Allow proactive reclaimers to submit an additional swappiness=<val>
argument to memory.reclaim. This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Userspace proactive reclaimers use the memory.reclaim interface to
trigger reclaim. The memory.reclaim interface does not allow for any way
to effect the balance of file vs anon during proactive reclaim. The only
approach is to adjust the vm.swappiness setting. However, there are a
few reasons we look to control the balance of file vs anon during
proactive reclaim, separately from reactive reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
situations we are fine with lots of swap-out to avoid OOMs. As these are
typically rare events, they have relatively little impact on write
endurance. However, proactive reclaim runs continuously and so its
impact on SSD write endurance is more significant. Therefore it is
desireable to control swap-out for proactive reclaim separately from
reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on
swap exhaustion. This makes sense if the swap exhaustion is triggered
due to reactive reclaim but less so if it is triggered due to proactive
reclaim (e.g. one could see OOMs when free memory is ample but anon is
just particularly cold). Therefore, it's desireable to have proactive
reclaim reduce or stop swap-out before the threshold at which OOM
killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust
vm.swappiness before writes to memory.reclaim[2]. This has been in
production for nearly two years and has addressed our needs to control
proactive vs reactive reclaim behavior but is still not ideal for a
number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere
with other system administration that wishes to control vm.swappiness.
In our case, we need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave
a misconfigured setting. This requires some additional management to
record the "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness
globally.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598

Signed-off-by: Dan Schatzberg <[email protected]>
Suggested-by: Yosry Ahmed <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Chris Li <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
include/linux/swap.h | 3 +-
mm/memcontrol.c | 56 ++++++++++++++++++++-----
mm/vmscan.c | 25 +++++++++--
4 files changed, 80 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..ee42f74e0765 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
This is a simple interface to trigger memory reclaim in the
target cgroup.

- This file accepts a single key, the number of bytes to reclaim.
- No nested keys are currently supported.
-
Example::

echo "1G" > memory.reclaim

- The interface can be later extended with nested keys to
- configure the reclaim behavior. For example, specify the
- type of memory to reclaim from (anon, file, ..).
-
Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned.
@@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back.
This means that the networking layer will not adapt based on
reclaim induced by memory.reclaim.

+The following nested keys are defined.
+
+ ========== ================================
+ swappiness Swappiness value to reclaim with
+ ========== ================================
+
+ Specifying a swappiness value instructs the kernel to perform
+ the reclaim with that swappiness value. Note that this has the
+ same semantics as vm.swappiness applied to memcg reclaim with
+ all the existing limitations and potential future extensions.
+
memory.peak
A read-only single value file which exists on non-root
cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e2ab76c25b4a..8afdec40efe3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options);
+ unsigned int reclaim_options,
+ int *swappiness);
extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fbe9f02dd206..6d627a754851 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -52,6 +52,7 @@
#include <linux/sort.h>
#include <linux/fs.h>
#include <linux/seq_file.h>
+#include <linux/parser.h>
#include <linux/vmpressure.h>
#include <linux/memremap.h>
#include <linux/mm_inline.h>
@@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
psi_memstall_enter(&pflags);
nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
gfp_mask,
- MEMCG_RECLAIM_MAY_SWAP);
+ MEMCG_RECLAIM_MAY_SWAP,
+ NULL);
psi_memstall_leave(&pflags);
} while ((memcg = parent_mem_cgroup(memcg)) &&
!mem_cgroup_is_root(memcg));
@@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,

psi_memstall_enter(&pflags);
nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
- gfp_mask, reclaim_options);
+ gfp_mask, reclaim_options, NULL);
psi_memstall_leave(&pflags);

if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
}

if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+ memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
ret = -EBUSY;
break;
}
@@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
return -EINTR;

if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- MEMCG_RECLAIM_MAY_SWAP))
+ MEMCG_RECLAIM_MAY_SWAP, NULL))
nr_retries--;
}

@@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
}

reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);

if (!reclaimed && !nr_retries--)
break;
@@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,

if (nr_reclaims) {
if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
nr_reclaims--;
continue;
}
@@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}

+enum {
+ MEMORY_RECLAIM_SWAPPINESS = 0,
+ MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t tokens = {
+ { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+ { MEMORY_RECLAIM_NULL, NULL },
+};
+
static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
unsigned int nr_retries = MAX_RECLAIM_RETRIES;
unsigned long nr_to_reclaim, nr_reclaimed = 0;
+ int swappiness = -1;
unsigned int reclaim_options;
- int err;
+ char *old_buf, *start;
+ substring_t args[MAX_OPT_ARGS];

buf = strstrip(buf);
- err = page_counter_memparse(buf, "", &nr_to_reclaim);
- if (err)
- return err;
+
+ old_buf = buf;
+ nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+ if (buf == old_buf)
+ return -EINVAL;
+
+ buf = strstrip(buf);
+
+ while ((start = strsep(&buf, " ")) != NULL) {
+ if (!strlen(start))
+ continue;
+ switch (match_token(start, tokens, args)) {
+ case MEMORY_RECLAIM_SWAPPINESS:
+ if (match_int(&args[0], &swappiness))
+ return -EINVAL;
+ if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+ }

reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
while (nr_reclaimed < nr_to_reclaim) {
@@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,

reclaimed = try_to_free_mem_cgroup_pages(memcg,
min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
- GFP_KERNEL, reclaim_options);
+ GFP_KERNEL, reclaim_options,
+ swappiness == -1 ? NULL : &swappiness);

if (!reclaimed && !nr_retries--)
return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d91963e2d47f..394e0dd46b2e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -92,6 +92,11 @@ struct scan_control {
unsigned long anon_cost;
unsigned long file_cost;

+#ifdef CONFIG_MEMCG
+ /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
+ int *proactive_swappiness;
+#endif
+
/* Can active folios be deactivated as part of reclaim? */
#define DEACTIVATE_ANON 1
#define DEACTIVATE_FILE 2
@@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
#endif
return false;
}
+
+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+ if (sc->proactive && sc->proactive_swappiness)
+ return *sc->proactive_swappiness;
+ return mem_cgroup_swappiness(memcg);
+}
#else
static bool cgroup_reclaim(struct scan_control *sc)
{
@@ -242,6 +254,11 @@ static bool writeback_throttling_sane(struct scan_control *sc)
{
return true;
}
+
+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+ return READ_ONCE(vm_swappiness);
+}
#endif

static void set_task_reclaim_state(struct task_struct *task,
@@ -2327,7 +2344,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
unsigned long anon_cost, file_cost, total_cost;
- int swappiness = mem_cgroup_swappiness(memcg);
+ int swappiness = sc_swappiness(sc, memcg);
u64 fraction[ANON_AND_FILE];
u64 denominator = 0; /* gcc */
enum scan_balance scan_balance;
@@ -2608,7 +2625,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
return 0;

- return mem_cgroup_swappiness(memcg);
+ return sc_swappiness(sc, memcg);
}

static int get_nr_gens(struct lruvec *lruvec, int type)
@@ -6463,12 +6480,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options)
+ unsigned int reclaim_options,
+ int *swappiness)
{
unsigned long nr_reclaimed;
unsigned int noreclaim_flag;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
+ .proactive_swappiness = swappiness,
.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
.reclaim_idx = MAX_NR_ZONES - 1,
--
2.39.3

2024-01-03 17:20:32

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

On Wed, Jan 3, 2024 at 9:49 AM Dan Schatzberg <[email protected]> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Userspace proactive reclaimers use the memory.reclaim interface to
> trigger reclaim. The memory.reclaim interface does not allow for any way
> to effect the balance of file vs anon during proactive reclaim. The only
> approach is to adjust the vm.swappiness setting. However, there are a
> few reasons we look to control the balance of file vs anon during
> proactive reclaim, separately from reactive reclaim:
>
> * Swapout should be limited to manage SSD write endurance. In near-OOM
> situations we are fine with lots of swap-out to avoid OOMs. As these are
> typically rare events, they have relatively little impact on write
> endurance. However, proactive reclaim runs continuously and so its
> impact on SSD write endurance is more significant. Therefore it is
> desireable to control swap-out for proactive reclaim separately from
> reactive reclaim
>
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
> swap exhaustion. This makes sense if the swap exhaustion is triggered
> due to reactive reclaim but less so if it is triggered due to proactive
> reclaim (e.g. one could see OOMs when free memory is ample but anon is
> just particularly cold). Therefore, it's desireable to have proactive
> reclaim reduce or stop swap-out before the threshold at which OOM
> killing occurs.
>
> In the case of Meta's Senpai proactive reclaimer, we adjust
> vm.swappiness before writes to memory.reclaim[2]. This has been in
> production for nearly two years and has addressed our needs to control
> proactive vs reactive reclaim behavior but is still not ideal for a
> number of reasons:
>
> * vm.swappiness is a global setting, adjusting it can race/interfere
> with other system administration that wishes to control vm.swappiness.
> In our case, we need to disable Senpai before adjusting vm.swappiness.
>
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave
> a misconfigured setting. This requires some additional management to
> record the "desired" setting and ensure Senpai always adjusts to it.
>
> With this patch, we avoid these downsides of adjusting vm.swappiness
> globally.
>
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
>
> Signed-off-by: Dan Schatzberg <[email protected]>
> Suggested-by: Yosry Ahmed <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Acked-by: Chris Li <[email protected]>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
> include/linux/swap.h | 3 +-
> mm/memcontrol.c | 56 ++++++++++++++++++++-----
> mm/vmscan.c | 25 +++++++++--
> 4 files changed, 80 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3f85254f3cef..ee42f74e0765 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
> This is a simple interface to trigger memory reclaim in the
> target cgroup.
>
> - This file accepts a single key, the number of bytes to reclaim.
> - No nested keys are currently supported.
> -
> Example::
>
> echo "1G" > memory.reclaim
>
> - The interface can be later extended with nested keys to
> - configure the reclaim behavior. For example, specify the
> - type of memory to reclaim from (anon, file, ..).
> -
> Please note that the kernel can over or under reclaim from
> the target cgroup. If less bytes are reclaimed than the
> specified amount, -EAGAIN is returned.
> @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back.
> This means that the networking layer will not adapt based on
> reclaim induced by memory.reclaim.
>
> +The following nested keys are defined.
> +
> + ========== ================================
> + swappiness Swappiness value to reclaim with
> + ========== ================================
> +
> + Specifying a swappiness value instructs the kernel to perform
> + the reclaim with that swappiness value. Note that this has the
> + same semantics as vm.swappiness applied to memcg reclaim with
> + all the existing limitations and potential future extensions.
> +
> memory.peak
> A read-only single value file which exists on non-root
> cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e2ab76c25b4a..8afdec40efe3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> unsigned long nr_pages,
> gfp_t gfp_mask,
> - unsigned int reclaim_options);
> + unsigned int reclaim_options,
> + int *swappiness);
> extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> gfp_t gfp_mask, bool noswap,
> pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fbe9f02dd206..6d627a754851 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -52,6 +52,7 @@
> #include <linux/sort.h>
> #include <linux/fs.h>
> #include <linux/seq_file.h>
> +#include <linux/parser.h>
> #include <linux/vmpressure.h>
> #include <linux/memremap.h>
> #include <linux/mm_inline.h>
> @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> psi_memstall_enter(&pflags);
> nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
> gfp_mask,
> - MEMCG_RECLAIM_MAY_SWAP);
> + MEMCG_RECLAIM_MAY_SWAP,
> + NULL);
> psi_memstall_leave(&pflags);
> } while ((memcg = parent_mem_cgroup(memcg)) &&
> !mem_cgroup_is_root(memcg));
> @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
> psi_memstall_enter(&pflags);
> nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> - gfp_mask, reclaim_options);
> + gfp_mask, reclaim_options, NULL);
> psi_memstall_leave(&pflags);
>
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> }
>
> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
> ret = -EBUSY;
> break;
> }
> @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> return -EINTR;
>
> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> - MEMCG_RECLAIM_MAY_SWAP))
> + MEMCG_RECLAIM_MAY_SWAP, NULL))
> nr_retries--;
> }
>
> @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> }
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
>
> if (!reclaimed && !nr_retries--)
> break;
> @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
> if (nr_reclaims) {
> if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
> nr_reclaims--;
> continue;
> }
> @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +enum {
> + MEMORY_RECLAIM_SWAPPINESS = 0,
> + MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> + { MEMORY_RECLAIM_NULL, NULL },
> +};
> +
> static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> size_t nbytes, loff_t off)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> unsigned long nr_to_reclaim, nr_reclaimed = 0;
> + int swappiness = -1;
> unsigned int reclaim_options;
> - int err;
> + char *old_buf, *start;
> + substring_t args[MAX_OPT_ARGS];
>
> buf = strstrip(buf);
> - err = page_counter_memparse(buf, "", &nr_to_reclaim);
> - if (err)
> - return err;
> +
> + old_buf = buf;
> + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> + if (buf == old_buf)
> + return -EINVAL;
> +
> + buf = strstrip(buf);
> +
> + while ((start = strsep(&buf, " ")) != NULL) {
> + if (!strlen(start))
> + continue;
> + switch (match_token(start, tokens, args)) {
> + case MEMORY_RECLAIM_SWAPPINESS:
> + if (match_int(&args[0], &swappiness))
> + return -EINVAL;
> + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> + return -EINVAL;
> + break;
> + default:
> + return -EINVAL;
> + }
> + }
>
> reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> while (nr_reclaimed < nr_to_reclaim) {
> @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg,
> min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> - GFP_KERNEL, reclaim_options);
> + GFP_KERNEL, reclaim_options,
> + swappiness == -1 ? NULL : &swappiness);
>
> if (!reclaimed && !nr_retries--)
> return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d91963e2d47f..394e0dd46b2e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,6 +92,11 @@ struct scan_control {
> unsigned long anon_cost;
> unsigned long file_cost;
>
> +#ifdef CONFIG_MEMCG
> + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> + int *proactive_swappiness;
> +#endif

Why is proactive_swappiness still a pointer? The whole point of the
previous conversation is that sc->proactive can tell whether
sc->swappiness is valid or not, and that's less awkward than using a
pointer.

Also why the #ifdef here? I don't see the point for a small stack
variable. Otherwise wouldn't we want to do this for sc->proactive as
well?

If you really want it to be explicit, you could do
struct scan_control {
...
struct {
bool is_set;
int swappiness;
} proactive;
};

But I think even this is too much.

2024-01-03 18:20:43

by Dan Schatzberg

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
[...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d91963e2d47f..394e0dd46b2e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -92,6 +92,11 @@ struct scan_control {
> > unsigned long anon_cost;
> > unsigned long file_cost;
> >
> > +#ifdef CONFIG_MEMCG
> > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > + int *proactive_swappiness;
> > +#endif
>
> Why is proactive_swappiness still a pointer? The whole point of the
> previous conversation is that sc->proactive can tell whether
> sc->swappiness is valid or not, and that's less awkward than using a
> pointer.

It's the same reason as before - zero initialization ensures that the
pointer is NULL which tells us if it's valid or not. Proactive reclaim
might not set swappiness and you need to distinguish swappiness of 0
and not-set. See this discussion with Michal:

https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/

> Also why the #ifdef here? I don't see the point for a small stack
> variable. Otherwise wouldn't we want to do this for sc->proactive as
> well?

This was Michal's request and it feels similar to your rationale for
naming it proactive_swappiness - it's just restricting the interface
down to the only use-cases. I'd be fine with doing the same in
sc->proactive as a subsequent patch.

See https://lore.kernel.org/linux-mm/ZZUhBoTNgL3AUK3f@tiehlicka/

2024-01-04 01:08:00

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d91963e2d47f..394e0dd46b2e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -92,6 +92,11 @@ struct scan_control {
> > > unsigned long anon_cost;
> > > unsigned long file_cost;
> > >
> > > +#ifdef CONFIG_MEMCG
> > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > + int *proactive_swappiness;
> > > +#endif
> >
> > Why is proactive_swappiness still a pointer? The whole point of the
> > previous conversation is that sc->proactive can tell whether
> > sc->swappiness is valid or not, and that's less awkward than using a
> > pointer.
>
> It's the same reason as before - zero initialization ensures that the
> pointer is NULL which tells us if it's valid or not. Proactive reclaim
> might not set swappiness and you need to distinguish swappiness of 0
> and not-set. See this discussion with Michal:
>
> https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/

static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
unsigned int nr_retries = MAX_RECLAIM_RETRIES;
unsigned long nr_to_reclaim, nr_reclaimed = 0;
+ int swappiness = -1;
...
reclaimed = try_to_free_mem_cgroup_pages(memcg,
min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
- GFP_KERNEL, reclaim_options);
+ GFP_KERNEL, reclaim_options,
+ swappiness);

...

+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+ return sc->proactive && sc->proactive_swappiness > -1 ?
+ sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
+}

2024-01-04 01:18:34

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

On Wed, Jan 3, 2024 at 11:20 AM Dan Schatzberg <[email protected]> wrote:
>
> On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d91963e2d47f..394e0dd46b2e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -92,6 +92,11 @@ struct scan_control {
> > > unsigned long anon_cost;
> > > unsigned long file_cost;
> > >
> > > +#ifdef CONFIG_MEMCG
> > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > + int *proactive_swappiness;
> > > +#endif
> >
> > Why is proactive_swappiness still a pointer? The whole point of the
> > previous conversation is that sc->proactive can tell whether
> > sc->swappiness is valid or not, and that's less awkward than using a
> > pointer.
>
> It's the same reason as before - zero initialization ensures that the
> pointer is NULL which tells us if it's valid or not. Proactive reclaim
> might not set swappiness and you need to distinguish swappiness of 0
> and not-set. See this discussion with Michal:
>
> https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
>
> > Also why the #ifdef here? I don't see the point for a small stack
> > variable. Otherwise wouldn't we want to do this for sc->proactive as
> > well?
>
> This was Michal's request and it feels similar to your rationale for
> naming it proactive_swappiness - it's just restricting the interface
> down to the only use-cases. I'd be fine with doing the same in
> sc->proactive as a subsequent patch.
>
> See https://lore.kernel.org/linux-mm/ZZUhBoTNgL3AUK3f@tiehlicka/

Also regarding #ifdef, quoting Documentation/process/4.Coding.rst:
"As a general rule, #ifdef use should be confined to header files
whenever possible."

2024-01-04 08:49:04

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

On Wed 03-01-24 18:07:43, Yu Zhao wrote:
> On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> > [...]
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index d91963e2d47f..394e0dd46b2e 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -92,6 +92,11 @@ struct scan_control {
> > > > unsigned long anon_cost;
> > > > unsigned long file_cost;
> > > >
> > > > +#ifdef CONFIG_MEMCG
> > > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > > + int *proactive_swappiness;
> > > > +#endif
> > >
> > > Why is proactive_swappiness still a pointer? The whole point of the
> > > previous conversation is that sc->proactive can tell whether
> > > sc->swappiness is valid or not, and that's less awkward than using a
> > > pointer.
> >
> > It's the same reason as before - zero initialization ensures that the
> > pointer is NULL which tells us if it's valid or not. Proactive reclaim
> > might not set swappiness and you need to distinguish swappiness of 0
> > and not-set. See this discussion with Michal:
> >
> > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
>
> static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> size_t nbytes, loff_t off)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> unsigned long nr_to_reclaim, nr_reclaimed = 0;
> + int swappiness = -1;
> ...
> reclaimed = try_to_free_mem_cgroup_pages(memcg,
> min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> - GFP_KERNEL, reclaim_options);
> + GFP_KERNEL, reclaim_options,
> + swappiness);
>
> ...
>
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> + return sc->proactive && sc->proactive_swappiness > -1 ?
> + sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
> +}

Tpo be completely honest I really fail to see why this is such a hot
discussion point. To be completely clear both approaches are feasible.

The main argument for NULL check based approach is that it is less error
prone from an incorrect ussage because any bug becomes obvious. If we
use any other special constant a missing initialization would be much
harder to spot because they would be subtle behavior change.

Are there really any strong arguments to go against this "default
initialization is safe" policy?
--
Michal Hocko
SUSE Labs

2024-01-04 10:09:35

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

On Wed 03-01-24 08:48:37, Dan Schatzberg wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d91963e2d47f..394e0dd46b2e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,6 +92,11 @@ struct scan_control {
> unsigned long anon_cost;
> unsigned long file_cost;
>
> +#ifdef CONFIG_MEMCG
> + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> + int *proactive_swappiness;
> +#endif
> +
> /* Can active folios be deactivated as part of reclaim? */
> #define DEACTIVATE_ANON 1
> #define DEACTIVATE_FILE 2
> @@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> #endif
> return false;
> }
> +
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> + if (sc->proactive && sc->proactive_swappiness)
> + return *sc->proactive_swappiness;
> + return mem_cgroup_swappiness(memcg);
> +}

If you really want to make this sc->proactive bound then do not use
CONFIG_MEMCG as sc->proactive is not guarded either.

I do not think that sc->proactive check is really necessary. A pure NULL
check is sufficient to have a valid and self evident code that is future
proof. But TBH this is not the most important aspect of the patch to
spend much more time discussing. Either go with sc->proactive but make
it config space consistent or simply rely on NULL check (with or without
MEMCG guard as both are valid options).

--
Michal Hocko
SUSE Labs

2024-01-09 23:56:05

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

On Thu, Jan 4, 2024 at 1:48 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 03-01-24 18:07:43, Yu Zhao wrote:
> > On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> > > [...]
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index d91963e2d47f..394e0dd46b2e 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -92,6 +92,11 @@ struct scan_control {
> > > > > unsigned long anon_cost;
> > > > > unsigned long file_cost;
> > > > >
> > > > > +#ifdef CONFIG_MEMCG
> > > > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > > > + int *proactive_swappiness;
> > > > > +#endif
> > > >
> > > > Why is proactive_swappiness still a pointer? The whole point of the
> > > > previous conversation is that sc->proactive can tell whether
> > > > sc->swappiness is valid or not, and that's less awkward than using a
> > > > pointer.
> > >
> > > It's the same reason as before - zero initialization ensures that the
> > > pointer is NULL which tells us if it's valid or not. Proactive reclaim
> > > might not set swappiness and you need to distinguish swappiness of 0
> > > and not-set. See this discussion with Michal:
> > >
> > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
> >
> > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > size_t nbytes, loff_t off)
> > {
> > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > + int swappiness = -1;
> > ...
> > reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > - GFP_KERNEL, reclaim_options);
> > + GFP_KERNEL, reclaim_options,
> > + swappiness);
> >
> > ...
> >
> > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> > +{
> > + return sc->proactive && sc->proactive_swappiness > -1 ?
> > + sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
> > +}
>
> Tpo be completely honest I really fail to see why this is such a hot
> discussion point. To be completely clear both approaches are feasible.

Feasible but not equal.

> The main argument for NULL check based approach is that it is less error
> prone from an incorrect ussage because any bug becomes obvious.

Any bug becomes *fatal*, and fatal isn't only obvious but also hurts
in production systems.

This was the reason for going through the trouble switching from
VM_BUG_ON() to VM_WARN_ON() and documenting it in
Documentation/process/coding-style.rst:

22) Do not crash the kernel
---------------------------

In general, the decision to crash the kernel belongs to the user, rather
than to the kernel developer.

Isn't?

> If we
> use any other special constant a missing initialization would be much
> harder to spot because they would be subtle behavior change.
>
> Are there really any strong arguments to go against this "default
> initialization is safe" policy?

Just wanted to point out an alternative. Fine details (best practices)
matter to me.

2024-01-09 23:58:38

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

On Thu, Jan 4, 2024 at 3:09 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 03-01-24 08:48:37, Dan Schatzberg wrote:
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d91963e2d47f..394e0dd46b2e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -92,6 +92,11 @@ struct scan_control {
> > unsigned long anon_cost;
> > unsigned long file_cost;
> >
> > +#ifdef CONFIG_MEMCG
> > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > + int *proactive_swappiness;
> > +#endif
> > +
> > /* Can active folios be deactivated as part of reclaim? */
> > #define DEACTIVATE_ANON 1
> > #define DEACTIVATE_FILE 2
> > @@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> > #endif
> > return false;
> > }
> > +
> > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> > +{
> > + if (sc->proactive && sc->proactive_swappiness)
> > + return *sc->proactive_swappiness;
> > + return mem_cgroup_swappiness(memcg);
> > +}
>
> If you really want to make this sc->proactive bound then do not use
> CONFIG_MEMCG as sc->proactive is not guarded either.
>
> I do not think that sc->proactive check is really necessary. A pure NULL
> check is sufficient to have a valid and self evident code that is future
> proof. But TBH this is not the most important aspect of the patch to
> spend much more time discussing. Either go with sc->proactive but make
> it config space consistent or simply rely on NULL check (with or without
> MEMCG guard as both are valid options).

Now you see why I replied. That "hybrid" if statement is just neither
of what was suggested.

2024-01-10 10:33:12

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim

On Tue 09-01-24 16:54:15, Yu Zhao wrote:
> On Thu, Jan 4, 2024 at 1:48 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 03-01-24 18:07:43, Yu Zhao wrote:
> > > On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> > > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> > > > [...]
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index d91963e2d47f..394e0dd46b2e 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -92,6 +92,11 @@ struct scan_control {
> > > > > > unsigned long anon_cost;
> > > > > > unsigned long file_cost;
> > > > > >
> > > > > > +#ifdef CONFIG_MEMCG
> > > > > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > > > > + int *proactive_swappiness;
> > > > > > +#endif
> > > > >
> > > > > Why is proactive_swappiness still a pointer? The whole point of the
> > > > > previous conversation is that sc->proactive can tell whether
> > > > > sc->swappiness is valid or not, and that's less awkward than using a
> > > > > pointer.
> > > >
> > > > It's the same reason as before - zero initialization ensures that the
> > > > pointer is NULL which tells us if it's valid or not. Proactive reclaim
> > > > might not set swappiness and you need to distinguish swappiness of 0
> > > > and not-set. See this discussion with Michal:
> > > >
> > > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
> > >
> > > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > > size_t nbytes, loff_t off)
> > > {
> > > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > > unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > > unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > > + int swappiness = -1;
> > > ...
> > > reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > > - GFP_KERNEL, reclaim_options);
> > > + GFP_KERNEL, reclaim_options,
> > > + swappiness);
> > >
> > > ...
> > >
> > > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> > > +{
> > > + return sc->proactive && sc->proactive_swappiness > -1 ?
> > > + sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
> > > +}
> >
> > Tpo be completely honest I really fail to see why this is such a hot
> > discussion point. To be completely clear both approaches are feasible.
>
> Feasible but not equal.
>
> > The main argument for NULL check based approach is that it is less error
> > prone from an incorrect ussage because any bug becomes obvious.
>
> Any bug becomes *fatal*, and fatal isn't only obvious but also hurts
> in production systems.
>
> This was the reason for going through the trouble switching from
> VM_BUG_ON() to VM_WARN_ON() and documenting it in
> Documentation/process/coding-style.rst:
>
> 22) Do not crash the kernel
> ---------------------------
>
> In general, the decision to crash the kernel belongs to the user, rather
> than to the kernel developer.
>
> Isn't?

I do agree with this general statement but I do not think it is
applicable in this context.

This is not an explicit BUG() when kernel explicitly sets to panic the
system. We are talking about subtle misbehavior which might be
non-trivial to debug (there are other reasons to not swap at all) vs. a
potential NULL ptr which will kill the userspace in a very obvious way.
Sure there are risks with that but checks for potential NULL ptr
dereferncing is easier than forgot explicit initialization. There are
clear pros and cons for both approaches. NULL default initialized
structures members which allow for behavior override are a general
kernel pattern so I do not really see this going way off the rails.
--
Michal Hocko
SUSE Labs

2024-06-11 19:25:40

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim

Hi folks,

This series has been in the mm-unstable for several months. Are there
any remaining concerns here otherwise can we please put this in the
mm-stable branch to be merged in the next Linux release?

On Wed, Jan 03, 2024 at 08:48:35AM GMT, Dan Schatzberg wrote:
> Changes since V5:
> * Made the scan_control behavior limited to proactive reclaim explicitly
> * created sc_swappiness helper to reduce chance of mis-use
>
> Changes since V4:
> * Fixed some initialization bugs by reverting back to a pointer for swappiness
> * Added some more caveats to the behavior of swappiness in documentation
>
> Changes since V3:
> * Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
> * Added explicit calls to mem_cgroup_swappiness
>
> Changes since V2:
> * No functional change
> * Used int consistently rather than a pointer
>
> Changes since V1:
> * Added documentation
>
> This patch proposes augmenting the memory.reclaim interface with a
> swappiness=<val> argument that overrides the swappiness value for that instance
> of proactive reclaim.
>
> Userspace proactive reclaimers use the memory.reclaim interface to trigger
> reclaim. The memory.reclaim interface does not allow for any way to effect the
> balance of file vs anon during proactive reclaim. The only approach is to adjust
> the vm.swappiness setting. However, there are a few reasons we look to control
> the balance of file vs anon during proactive reclaim, separately from reactive
> reclaim:
>
> * Swapout should be limited to manage SSD write endurance. In near-OOM
> situations we are fine with lots of swap-out to avoid OOMs. As these are
> typically rare events, they have relatively little impact on write endurance.
> However, proactive reclaim runs continuously and so its impact on SSD write
> endurance is more significant. Therefore it is desireable to control swap-out
> for proactive reclaim separately from reactive reclaim
>
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
> exhaustion. This makes sense if the swap exhaustion is triggered due to
> reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
> one could see OOMs when free memory is ample but anon is just particularly
> cold). Therefore, it's desireable to have proactive reclaim reduce or stop
> swap-out before the threshold at which OOM killing occurs.
>
> In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
> writes to memory.reclaim[2]. This has been in production for nearly two years
> and has addressed our needs to control proactive vs reactive reclaim behavior
> but is still not ideal for a number of reasons:
>
> * vm.swappiness is a global setting, adjusting it can race/interfere with other
> system administration that wishes to control vm.swappiness. In our case, we
> need to disable Senpai before adjusting vm.swappiness.
>
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave a
> misconfigured setting. This requires some additional management to record the
> "desired" setting and ensure Senpai always adjusts to it.
>
> With this patch, we avoid these downsides of adjusting vm.swappiness globally.
>
> Previously, this exact interface addition was proposed by Yosry[3]. In response,
> Roman proposed instead an interface to specify precise file/anon/slab reclaim
> amounts[4]. More recently Huan also proposed this as well[5] and others
> similarly questioned if this was the proper interface.
>
> Previous proposals sought to use this to allow proactive reclaimers to
> effectively perform a custom reclaim algorithm by issuing proactive reclaim with
> different settings to control file vs anon reclaim (e.g. to only reclaim anon
> from some applications). Responses argued that adjusting swappiness is a poor
> interface for custom reclaim.
>
> In contrast, I argue in favor of a swappiness setting not as a way to implement
> custom reclaim algorithms but rather to bias the balance of anon vs file due to
> differences of proactive vs reactive reclaim. In this context, swappiness is the
> existing interface for controlling this balance and this patch simply allows for
> it to be configured differently for proactive vs reactive reclaim.
>
> Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
> for this prupose. Proactive reclaimers are un-aware of the relative age of file
> vs anon for a cgroup which makes it difficult to manage proactive reclaim of
> different memory pools. A proactive reclaimer would need some amount of anon
> reclaim attempts separate from the amount of file reclaim attempts which seems
> brittle given that it's difficult to observe the impact.
>
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
> [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
> [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
> [5]https://lore.kernel.org/lkml/[email protected]/
>
> Dan Schatzberg (2):
> mm: add defines for min/max swappiness
> mm: add swapiness= arg to memory.reclaim
>
> Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
> include/linux/swap.h | 5 ++-
> mm/memcontrol.c | 58 ++++++++++++++++++++-----
> mm/vmscan.c | 39 ++++++++++++-----
> 4 files changed, 90 insertions(+), 30 deletions(-)
>
> --
> 2.39.3
>

2024-06-11 19:33:19

by Yosry Ahmed

[permalink] [raw]

Subject: Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim

On Tue, Jun 11, 2024 at 12:25 PM Shakeel Butt <[email protected]> wrote:
>
> Hi folks,
>
> This series has been in the mm-unstable for several months. Are there
> any remaining concerns here otherwise can we please put this in the
> mm-stable branch to be merged in the next Linux release?

+Yu Zhao

I don't think Yu Zhao was correctly CC'd on this :)

>
> On Wed, Jan 03, 2024 at 08:48:35AM GMT, Dan Schatzberg wrote:
> > Changes since V5:
> > * Made the scan_control behavior limited to proactive reclaim explicitly
> > * created sc_swappiness helper to reduce chance of mis-use
> >
> > Changes since V4:
> > * Fixed some initialization bugs by reverting back to a pointer for swappiness
> > * Added some more caveats to the behavior of swappiness in documentation
> >
> > Changes since V3:
> > * Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
> > * Added explicit calls to mem_cgroup_swappiness
> >
> > Changes since V2:
> > * No functional change
> > * Used int consistently rather than a pointer
> >
> > Changes since V1:
> > * Added documentation
> >
> > This patch proposes augmenting the memory.reclaim interface with a
> > swappiness=<val> argument that overrides the swappiness value for that instance
> > of proactive reclaim.
> >
> > Userspace proactive reclaimers use the memory.reclaim interface to trigger
> > reclaim. The memory.reclaim interface does not allow for any way to effect the
> > balance of file vs anon during proactive reclaim. The only approach is to adjust
> > the vm.swappiness setting. However, there are a few reasons we look to control
> > the balance of file vs anon during proactive reclaim, separately from reactive
> > reclaim:
> >
> > * Swapout should be limited to manage SSD write endurance. In near-OOM
> > situations we are fine with lots of swap-out to avoid OOMs. As these are
> > typically rare events, they have relatively little impact on write endurance.
> > However, proactive reclaim runs continuously and so its impact on SSD write
> > endurance is more significant. Therefore it is desireable to control swap-out
> > for proactive reclaim separately from reactive reclaim
> >
> > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
> > exhaustion. This makes sense if the swap exhaustion is triggered due to
> > reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
> > one could see OOMs when free memory is ample but anon is just particularly
> > cold). Therefore, it's desireable to have proactive reclaim reduce or stop
> > swap-out before the threshold at which OOM killing occurs.
> >
> > In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
> > writes to memory.reclaim[2]. This has been in production for nearly two years
> > and has addressed our needs to control proactive vs reactive reclaim behavior
> > but is still not ideal for a number of reasons:
> >
> > * vm.swappiness is a global setting, adjusting it can race/interfere with other
> > system administration that wishes to control vm.swappiness. In our case, we
> > need to disable Senpai before adjusting vm.swappiness.
> >
> > * vm.swappiness is stateful - so a crash or restart of Senpai can leave a
> > misconfigured setting. This requires some additional management to record the
> > "desired" setting and ensure Senpai always adjusts to it.
> >
> > With this patch, we avoid these downsides of adjusting vm.swappiness globally.
> >
> > Previously, this exact interface addition was proposed by Yosry[3]. In response,
> > Roman proposed instead an interface to specify precise file/anon/slab reclaim
> > amounts[4]. More recently Huan also proposed this as well[5] and others
> > similarly questioned if this was the proper interface.
> >
> > Previous proposals sought to use this to allow proactive reclaimers to
> > effectively perform a custom reclaim algorithm by issuing proactive reclaim with
> > different settings to control file vs anon reclaim (e.g. to only reclaim anon
> > from some applications). Responses argued that adjusting swappiness is a poor
> > interface for custom reclaim.
> >
> > In contrast, I argue in favor of a swappiness setting not as a way to implement
> > custom reclaim algorithms but rather to bias the balance of anon vs file due to
> > differences of proactive vs reactive reclaim. In this context, swappiness is the
> > existing interface for controlling this balance and this patch simply allows for
> > it to be configured differently for proactive vs reactive reclaim.
> >
> > Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
> > for this prupose. Proactive reclaimers are un-aware of the relative age of file
> > vs anon for a cgroup which makes it difficult to manage proactive reclaim of
> > different memory pools. A proactive reclaimer would need some amount of anon
> > reclaim attempts separate from the amount of file reclaim attempts which seems
> > brittle given that it's difficult to observe the impact.
> >
> > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
> > [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
> > [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
> > [5]https://lore.kernel.org/lkml/[email protected]/
> >
> > Dan Schatzberg (2):
> > mm: add defines for min/max swappiness
> > mm: add swapiness= arg to memory.reclaim
> >
> > Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
> > include/linux/swap.h | 5 ++-
> > mm/memcontrol.c | 58 ++++++++++++++++++++-----
> > mm/vmscan.c | 39 ++++++++++++-----
> > 4 files changed, 90 insertions(+), 30 deletions(-)
> >
> > --
> > 2.39.3
> >

2024-06-11 19:50:12

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim

On Tue, 11 Jun 2024 12:25:24 -0700 Shakeel Butt <[email protected]> wrote:

> Hi folks,
>
> This series has been in the mm-unstable for several months. Are there
> any remaining concerns here otherwise can we please put this in the
> mm-stable branch to be merged in the next Linux release?

The review didn't go terribly well so I parked the series awaiting more
clarity. Although on rereading, it seems that Yu Zhao isn't seeing any
blocking issues?

2024-06-11 22:51:07

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim

On Tue, Jun 11, 2024 at 12:48:07PM GMT, Andrew Morton wrote:
> On Tue, 11 Jun 2024 12:25:24 -0700 Shakeel Butt <[email protected]> wrote:
>
> > Hi folks,
> >
> > This series has been in the mm-unstable for several months. Are there
> > any remaining concerns here otherwise can we please put this in the
> > mm-stable branch to be merged in the next Linux release?
>
> The review didn't go terribly well so I parked the series awaiting more
> clarity. Although on rereading, it seems that Yu Zhao isn't seeing any
> blocking issues?
>

Yu, please share if you have any strong concern in merging this series?

2024-06-11 23:11:02

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim

On Tue, Jun 11, 2024 at 4:50 PM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Jun 11, 2024 at 12:48:07PM GMT, Andrew Morton wrote:
> > On Tue, 11 Jun 2024 12:25:24 -0700 Shakeel Butt <[email protected]> wrote:
> >
> > > Hi folks,
> > >
> > > This series has been in the mm-unstable for several months. Are there
> > > any remaining concerns here otherwise can we please put this in the
> > > mm-stable branch to be merged in the next Linux release?
> >
> > The review didn't go terribly well so I parked the series awaiting more
> > clarity. Although on rereading, it seems that Yu Zhao isn't seeing any
> > blocking issues?
> >
>
> Yu, please share if you have any strong concern in merging this series?

I don't remember I had any strong concerns. In fact, I don't remember
what I commented on.

Let me go back to the previous discussion and see why it was stalled.
Will get back to you soon.