Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and
allows protecting working sets of important workloads from
sudden reclaim.
But its semantics has a significant limitation: it works
only as long as there is a supply of reclaimable memory.
This makes it pretty useless against any sort of slow memory
leaks or memory usage increases. This is especially true
for swapless systems. If swap is enabled, memory soft protection
effectively postpones problems, allowing a leaking application
to fill all swap area, which makes no sense.
The only effective way to guarantee the memory protection
in this case is to invoke the OOM killer.
It's possible to handle this case in userspace by reacting
on MEMCG_LOW events; but there is still a place for a fail-safe
in-kernel mechanism to provide stronger guarantees.
This patch introduces the memory.min interface for cgroup v2
memory controller. It works very similarly to memory.low
(sharing the same hierarchical behavior), except that it's
not disabled if there is no more reclaimable memory in the system.
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Tejun Heo <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
Documentation/cgroup-v2.txt | 24 ++++++++-
include/linux/memcontrol.h | 15 ++++--
include/linux/page_counter.h | 11 +++-
mm/memcontrol.c | 118 ++++++++++++++++++++++++++++++++++---------
mm/page_counter.c | 63 ++++++++++++++++-------
mm/vmscan.c | 18 ++++++-
6 files changed, 199 insertions(+), 50 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 657fe1769c75..a413118b9c29 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1002,6 +1002,26 @@ PAGE_SIZE multiple when read back.
The total amount of memory currently being used by the cgroup
and its descendants.
+ memory.min
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ Hard memory protection. If the memory usage of a cgroup
+ is within its effective min boundary, the cgroup's memory
+ won't be reclaimed under any conditions. If there is no
+ unprotected reclaimable memory available, OOM killer
+ is invoked.
+
+ Effective low boundary is limited by memory.min values of
+ all ancestor cgroups. If there is memory.min overcommitment
+ (child cgroup or cgroups are requiring more protected memory
+ than parent will allow), then each child cgroup will get
+ the part of parent's protection proportional to its
+ actual memory usage below memory.min.
+
+ Putting more memory than generally available under this
+ protection is discouraged and may lead to constant OOMs.
+
memory.low
A read-write single value file which exists on non-root
cgroups. The default is "0".
@@ -1013,9 +1033,9 @@ PAGE_SIZE multiple when read back.
Effective low boundary is limited by memory.low values of
all ancestor cgroups. If there is memory.low overcommitment
- (child cgroup or cgroups are requiring more protected memory,
+ (child cgroup or cgroups are requiring more protected memory
than parent will allow), then each child cgroup will get
- the part of parent's protection proportional to the its
+ the part of parent's protection proportional to its
actual memory usage below memory.low.
Putting more memory than generally available under this
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a2dfb1872dca..3b65d092614f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -59,6 +59,12 @@ enum memcg_memory_event {
MEMCG_NR_MEMORY_EVENTS,
};
+enum mem_cgroup_protection {
+ MEMCG_PROT_NONE,
+ MEMCG_PROT_LOW,
+ MEMCG_PROT_MIN,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
int priority;
@@ -297,7 +303,8 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
}
-bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
+enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
+ struct mem_cgroup *memcg);
int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **memcgp,
@@ -756,10 +763,10 @@ static inline void memcg_memory_event(struct mem_cgroup *memcg,
{
}
-static inline bool mem_cgroup_low(struct mem_cgroup *root,
- struct mem_cgroup *memcg)
+static inline enum mem_cgroup_protection mem_cgroup_protected(
+ struct mem_cgroup *root, struct mem_cgroup *memcg)
{
- return false;
+ return MEMCG_PROT_NONE;
}
static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 7902a727d3b6..bab7e57f659b 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -8,10 +8,16 @@
struct page_counter {
atomic_long_t usage;
- unsigned long max;
+ unsigned long min;
unsigned long low;
+ unsigned long max;
struct page_counter *parent;
+ /* effective memory.min and memory.min usage tracking */
+ unsigned long emin;
+ atomic_long_t min_usage;
+ atomic_long_t children_min_usage;
+
/* effective memory.low and memory.low usage tracking */
unsigned long elow;
atomic_long_t low_usage;
@@ -47,8 +53,9 @@ bool page_counter_try_charge(struct page_counter *counter,
unsigned long nr_pages,
struct page_counter **fail);
void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
-int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages);
int page_counter_memparse(const char *buf, const char *max,
unsigned long *nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index db89468c231c..d298b06a7fad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4503,6 +4503,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
}
spin_unlock(&memcg->event_list_lock);
+ page_counter_set_min(&memcg->memory, 0);
page_counter_set_low(&memcg->memory, 0);
memcg_offline_kmem(memcg);
@@ -4557,6 +4558,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
page_counter_set_max(&memcg->memsw, PAGE_COUNTER_MAX);
page_counter_set_max(&memcg->kmem, PAGE_COUNTER_MAX);
page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
+ page_counter_set_min(&memcg->memory, 0);
page_counter_set_low(&memcg->memory, 0);
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
@@ -5294,6 +5296,36 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
}
+static int memory_min_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ unsigned long min = READ_ONCE(memcg->memory.min);
+
+ if (min == PAGE_COUNTER_MAX)
+ seq_puts(m, "max\n");
+ else
+ seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE);
+
+ return 0;
+}
+
+static ssize_t memory_min_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long min;
+ int err;
+
+ buf = strstrip(buf);
+ err = page_counter_memparse(buf, "max", &min);
+ if (err)
+ return err;
+
+ page_counter_set_min(&memcg->memory, min);
+
+ return nbytes;
+}
+
static int memory_low_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5561,6 +5593,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = memory_current_read,
},
+ {
+ .name = "min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_min_show,
+ .write = memory_min_write,
+ },
{
.name = "low",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -5616,19 +5654,24 @@ struct cgroup_subsys memory_cgrp_subsys = {
};
/**
- * mem_cgroup_low - check if memory consumption is in the normal range
+ * mem_cgroup_protected - check if memory consumption is in the normal range
* @root: the top ancestor of the sub-tree being checked
* @memcg: the memory cgroup to check
*
* WARNING: This function is not stateless! It can only be used as part
* of a top-down tree iteration, not for isolated queries.
*
- * Returns %true if memory consumption of @memcg is in the normal range.
+ * Returns one of the following:
+ * MEMCG_PROT_NONE: cgroup memory is not protected
+ * MEMCG_PROT_LOW: cgroup memory is protected as long there is
+ * an unprotected supply of reclaimable memory from other cgroups.
+ * MEMCG_PROT_MIN: cgroup memory is protected
*
- * @root is exclusive; it is never low when looked at directly
+ * @root is exclusive; it is never protected when looked at directly
*
- * To provide a proper hierarchical behavior, effective memory.low value
- * is used.
+ * To provide a proper hierarchical behavior, effective memory.min/low values
+ * are used. Below is the description of how effective memory.low is calculated.
+ * Effective memory.min values is calculated in the same way.
*
* Effective memory.low is always equal or less than the original memory.low.
* If there is no memory.low overcommittment (which is always true for
@@ -5673,51 +5716,78 @@ struct cgroup_subsys memory_cgrp_subsys = {
* E/memory.current = 0
*
* These calculations require constant tracking of the actual low usages
- * (see propagate_low_usage()), as well as recursive calculation of
- * effective memory.low values. But as we do call mem_cgroup_low()
+ * (see propagate_protected_usage()), as well as recursive calculation of
+ * effective memory.low values. But as we do call mem_cgroup_protected()
* path for each memory cgroup top-down from the reclaim,
* it's possible to optimize this part, and save calculated elow
* for next usage. This part is intentionally racy, but it's ok,
* as memory.low is a best-effort mechanism.
*/
-bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
+enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
{
- unsigned long usage, low_usage, siblings_low_usage;
- unsigned long elow, parent_elow;
struct mem_cgroup *parent;
+ unsigned long emin, parent_emin;
+ unsigned long elow, parent_elow;
+ unsigned long usage;
if (mem_cgroup_disabled())
- return false;
+ return MEMCG_PROT_NONE;
if (!root)
root = root_mem_cgroup;
if (memcg == root)
- return false;
+ return MEMCG_PROT_NONE;
- elow = memcg->memory.low;
usage = page_counter_read(&memcg->memory);
- parent = parent_mem_cgroup(memcg);
+ if (!usage)
+ return MEMCG_PROT_NONE;
+
+ emin = memcg->memory.min;
+ elow = memcg->memory.low;
+ parent = parent_mem_cgroup(memcg);
if (parent == root)
goto exit;
+ parent_emin = READ_ONCE(parent->memory.emin);
+ emin = min(emin, parent_emin);
+ if (emin && parent_emin) {
+ unsigned long min_usage, siblings_min_usage;
+
+ min_usage = min(usage, memcg->memory.min);
+ siblings_min_usage = atomic_long_read(
+ &parent->memory.children_min_usage);
+
+ if (min_usage && siblings_min_usage)
+ emin = min(emin, parent_emin * min_usage /
+ siblings_min_usage);
+ }
+
parent_elow = READ_ONCE(parent->memory.elow);
elow = min(elow, parent_elow);
+ if (elow && parent_elow) {
+ unsigned long low_usage, siblings_low_usage;
- if (!elow || !parent_elow)
- goto exit;
+ low_usage = min(usage, memcg->memory.low);
+ siblings_low_usage = atomic_long_read(
+ &parent->memory.children_low_usage);
- low_usage = min(usage, memcg->memory.low);
- siblings_low_usage = atomic_long_read(
- &parent->memory.children_low_usage);
-
- if (!low_usage || !siblings_low_usage)
- goto exit;
+ if (low_usage && siblings_low_usage)
+ elow = min(elow, parent_elow * low_usage /
+ siblings_low_usage);
+ }
- elow = min(elow, parent_elow * low_usage / siblings_low_usage);
exit:
+ memcg->memory.emin = emin;
memcg->memory.elow = elow;
- return usage && usage <= elow;
+
+ if (usage <= emin)
+ return MEMCG_PROT_MIN;
+ else if (usage <= elow)
+ return MEMCG_PROT_LOW;
+ else
+ return MEMCG_PROT_NONE;
}
/**
diff --git a/mm/page_counter.c b/mm/page_counter.c
index a5ff4cbc355a..de31470655f6 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -13,26 +13,38 @@
#include <linux/bug.h>
#include <asm/page.h>
-static void propagate_low_usage(struct page_counter *c, unsigned long usage)
+static void propagate_protected_usage(struct page_counter *c,
+ unsigned long usage)
{
- unsigned long low_usage, old;
+ unsigned long protected, old_protected;
long delta;
if (!c->parent)
return;
- if (!c->low && !atomic_long_read(&c->low_usage))
- return;
+ if (c->min || atomic_long_read(&c->min_usage)) {
+ if (usage <= c->min)
+ protected = usage;
+ else
+ protected = 0;
+
+ old_protected = atomic_long_xchg(&c->min_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_min_usage);
+ }
- if (usage <= c->low)
- low_usage = usage;
- else
- low_usage = 0;
+ if (c->low || atomic_long_read(&c->low_usage)) {
+ if (usage <= c->low)
+ protected = usage;
+ else
+ protected = 0;
- old = atomic_long_xchg(&c->low_usage, low_usage);
- delta = low_usage - old;
- if (delta)
- atomic_long_add(delta, &c->parent->children_low_usage);
+ old_protected = atomic_long_xchg(&c->low_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_low_usage);
+ }
}
/**
@@ -45,7 +57,7 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
long new;
new = atomic_long_sub_return(nr_pages, &counter->usage);
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/* More uncharges than charges? */
WARN_ON_ONCE(new < 0);
}
@@ -65,7 +77,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
long new;
new = atomic_long_add_return(nr_pages, &c->usage);
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/*
* This is indeed racy, but we can live with some
* inaccuracy in the watermark.
@@ -109,7 +121,7 @@ bool page_counter_try_charge(struct page_counter *counter,
new = atomic_long_add_return(nr_pages, &c->usage);
if (new > c->max) {
atomic_long_sub(nr_pages, &c->usage);
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/*
* This is racy, but we can live with some
* inaccuracy in the failcnt.
@@ -118,7 +130,7 @@ bool page_counter_try_charge(struct page_counter *counter,
*fail = c;
goto failed;
}
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/*
* Just like with failcnt, we can live with some
* inaccuracy in the watermark.
@@ -190,6 +202,23 @@ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages)
}
}
+/**
+ * page_counter_set_min - set the amount of protected memory
+ * @counter: counter
+ * @nr_pages: value to set
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+
+ counter->min = nr_pages;
+
+ for (c = counter; c; c = c->parent)
+ propagate_protected_usage(c, atomic_long_read(&c->usage));
+}
+
/**
* page_counter_set_low - set the amount of protected memory
* @counter: counter
@@ -204,7 +233,7 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
counter->low = nr_pages;
for (c = counter; c; c = c->parent)
- propagate_low_usage(c, atomic_long_read(&c->usage));
+ propagate_protected_usage(c, atomic_long_read(&c->usage));
}
/**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10c8a38c5eef..50055d72f294 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2544,12 +2544,28 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
unsigned long reclaimed;
unsigned long scanned;
- if (mem_cgroup_low(root, memcg)) {
+ switch (mem_cgroup_protected(root, memcg)) {
+ case MEMCG_PROT_MIN:
+ /*
+ * Hard protection.
+ * If there is no reclaimable memory, OOM.
+ */
+ continue;
+ case MEMCG_PROT_LOW:
+ /*
+ * Soft protection.
+ * Respect the protection only as long as
+ * there is an unprotected supply
+ * of reclaimable memory from other cgroups.
+ */
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
continue;
}
memcg_memory_event(memcg, MEMCG_LOW);
+ break;
+ case MEMCG_PROT_NONE:
+ break;
}
reclaimed = sc->nr_reclaimed;
--
2.14.3
If a cgroup has no associated tasks, invoking the OOM killer
won't help release any memory, so respecting the memory.min
can lead to an infinite OOM loop or system stall.
Let's ignore memory.min of unpopulated cgroups.
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 10 ++++++++++
mm/vmscan.c | 6 +++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3b65d092614f..7d8472022aae 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -374,6 +374,11 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
css_put(&memcg->css);
}
+static inline bool mem_cgroup_is_populated(struct mem_cgroup *memcg)
+{
+ return cgroup_is_populated(memcg->css.cgroup);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
@@ -835,6 +840,11 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
+static inline bool mem_cgroup_is_populated(struct mem_cgroup *memcg)
+{
+ return false;
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 50055d72f294..5e2047e04770 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2549,8 +2549,12 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
/*
* Hard protection.
* If there is no reclaimable memory, OOM.
+ * Abandoned cgroups are losing protection,
+ * because OOM killer won't release any memory.
*/
- continue;
+ if (mem_cgroup_is_populated(memcg))
+ continue;
+ break;
case MEMCG_PROT_LOW:
/*
* Soft protection.
--
2.14.3
On Thu, May 03, 2018 at 12:43:58PM +0100, Roman Gushchin wrote:
> If a cgroup has no associated tasks, invoking the OOM killer
> won't help release any memory, so respecting the memory.min
> can lead to an infinite OOM loop or system stall.
>
> Let's ignore memory.min of unpopulated cgroups.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Tejun Heo <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
I wouldn't mind merging this into the previous patch. It's fairly
small, and there is no reason to introduce an infinite OOM loop
scenario into the tree, even if it's just for one commit.
On Thu, May 03, 2018 at 01:38:35PM -0400, Johannes Weiner wrote:
> On Thu, May 03, 2018 at 12:43:58PM +0100, Roman Gushchin wrote:
> > If a cgroup has no associated tasks, invoking the OOM killer
> > won't help release any memory, so respecting the memory.min
> > can lead to an infinite OOM loop or system stall.
> >
> > Let's ignore memory.min of unpopulated cgroups.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Vladimir Davydov <[email protected]>
> > Cc: Tejun Heo <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>
>
> I wouldn't mind merging this into the previous patch. It's fairly
> small, and there is no reason to introduce an infinite OOM loop
> scenario into the tree, even if it's just for one commit.
OK, makes sense.
Here is an updated version: I've merged two commits into one,
added a small note about empty cgroups to docs and rebased to mm.
Andrew, can you, please, pull this one?
Thank you!
--
From 963b29d85a3c4624bc5fd167a645737c7cb54ed9 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <[email protected]>
Date: Wed, 9 May 2018 18:43:34 +0100
Subject: [PATCH v4] mm: introduce memory.min
Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and
allows protecting working sets of important workloads from
sudden reclaim.
But its semantics has a significant limitation: it works
only as long as there is a supply of reclaimable memory.
This makes it pretty useless against any sort of slow memory
leaks or memory usage increases. This is especially true
for swapless systems. If swap is enabled, memory soft protection
effectively postpones problems, allowing a leaking application
to fill all swap area, which makes no sense.
The only effective way to guarantee the memory protection
in this case is to invoke the OOM killer.
It's possible to handle this case in userspace by reacting
on MEMCG_LOW events; but there is still a place for a fail-safe
in-kernel mechanism to provide stronger guarantees.
This patch introduces the memory.min interface for cgroup v2
memory controller. It works very similarly to memory.low
(sharing the same hierarchical behavior), except that it's
not disabled if there is no more reclaimable memory in the system.
If cgroup is not populated, its memory.min is ignored,
because otherwise even the OOM killer wouldn't be able
to reclaim the protected memory, and the system can stall.
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Tejun Heo <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
Documentation/cgroup-v2.txt | 27 +++++++++-
include/linux/memcontrol.h | 15 ++++--
include/linux/page_counter.h | 11 +++-
mm/memcontrol.c | 118 ++++++++++++++++++++++++++++++++++---------
mm/page_counter.c | 63 ++++++++++++++++-------
mm/vmscan.c | 18 ++++++-
6 files changed, 202 insertions(+), 50 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 657fe1769c75..1764a627a120 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1002,6 +1002,29 @@ PAGE_SIZE multiple when read back.
The total amount of memory currently being used by the cgroup
and its descendants.
+ memory.min
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ Hard memory protection. If the memory usage of a cgroup
+ is within its effective min boundary, the cgroup's memory
+ won't be reclaimed under any conditions. If there is no
+ unprotected reclaimable memory available, OOM killer
+ is invoked.
+
+ Effective low boundary is limited by memory.min values of
+ all ancestor cgroups. If there is memory.min overcommitment
+ (child cgroup or cgroups are requiring more protected memory
+ than parent will allow), then each child cgroup will get
+ the part of parent's protection proportional to its
+ actual memory usage below memory.min.
+
+ Putting more memory than generally available under this
+ protection is discouraged and may lead to constant OOMs.
+
+ If a memory cgroup is not populated with processes,
+ its memory.min is ignored.
+
memory.low
A read-write single value file which exists on non-root
cgroups. The default is "0".
@@ -1013,9 +1036,9 @@ PAGE_SIZE multiple when read back.
Effective low boundary is limited by memory.low values of
all ancestor cgroups. If there is memory.low overcommitment
- (child cgroup or cgroups are requiring more protected memory,
+ (child cgroup or cgroups are requiring more protected memory
than parent will allow), then each child cgroup will get
- the part of parent's protection proportional to the its
+ the part of parent's protection proportional to its
actual memory usage below memory.low.
Putting more memory than generally available under this
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 30d7b495b02d..c6d2ca859802 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,6 +60,12 @@ enum memcg_memory_event {
MEMCG_NR_MEMORY_EVENTS,
};
+enum mem_cgroup_protection {
+ MEMCG_PROT_NONE,
+ MEMCG_PROT_LOW,
+ MEMCG_PROT_MIN,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
int priority;
@@ -298,7 +304,8 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
}
-bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
+enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
+ struct mem_cgroup *memcg);
int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **memcgp,
@@ -776,10 +783,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
{
}
-static inline bool mem_cgroup_low(struct mem_cgroup *root,
- struct mem_cgroup *memcg)
+static inline enum mem_cgroup_protection mem_cgroup_protected(
+ struct mem_cgroup *root, struct mem_cgroup *memcg)
{
- return false;
+ return MEMCG_PROT_NONE;
}
static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 7902a727d3b6..bab7e57f659b 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -8,10 +8,16 @@
struct page_counter {
atomic_long_t usage;
- unsigned long max;
+ unsigned long min;
unsigned long low;
+ unsigned long max;
struct page_counter *parent;
+ /* effective memory.min and memory.min usage tracking */
+ unsigned long emin;
+ atomic_long_t min_usage;
+ atomic_long_t children_min_usage;
+
/* effective memory.low and memory.low usage tracking */
unsigned long elow;
atomic_long_t low_usage;
@@ -47,8 +53,9 @@ bool page_counter_try_charge(struct page_counter *counter,
unsigned long nr_pages,
struct page_counter **fail);
void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
-int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages);
int page_counter_memparse(const char *buf, const char *max,
unsigned long *nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb92c37c29d3..bdb8028c806c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4505,6 +4505,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
}
spin_unlock(&memcg->event_list_lock);
+ page_counter_set_min(&memcg->memory, 0);
page_counter_set_low(&memcg->memory, 0);
memcg_offline_kmem(memcg);
@@ -4559,6 +4560,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
page_counter_set_max(&memcg->memsw, PAGE_COUNTER_MAX);
page_counter_set_max(&memcg->kmem, PAGE_COUNTER_MAX);
page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
+ page_counter_set_min(&memcg->memory, 0);
page_counter_set_low(&memcg->memory, 0);
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
@@ -5349,6 +5351,36 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
}
+static int memory_min_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ unsigned long min = READ_ONCE(memcg->memory.min);
+
+ if (min == PAGE_COUNTER_MAX)
+ seq_puts(m, "max\n");
+ else
+ seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE);
+
+ return 0;
+}
+
+static ssize_t memory_min_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long min;
+ int err;
+
+ buf = strstrip(buf);
+ err = page_counter_memparse(buf, "max", &min);
+ if (err)
+ return err;
+
+ page_counter_set_min(&memcg->memory, min);
+
+ return nbytes;
+}
+
static int memory_low_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5617,6 +5649,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = memory_current_read,
},
+ {
+ .name = "min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_min_show,
+ .write = memory_min_write,
+ },
{
.name = "low",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -5674,19 +5712,24 @@ struct cgroup_subsys memory_cgrp_subsys = {
};
/**
- * mem_cgroup_low - check if memory consumption is in the normal range
+ * mem_cgroup_protected - check if memory consumption is in the normal range
* @root: the top ancestor of the sub-tree being checked
* @memcg: the memory cgroup to check
*
* WARNING: This function is not stateless! It can only be used as part
* of a top-down tree iteration, not for isolated queries.
*
- * Returns %true if memory consumption of @memcg is in the normal range.
+ * Returns one of the following:
+ * MEMCG_PROT_NONE: cgroup memory is not protected
+ * MEMCG_PROT_LOW: cgroup memory is protected as long there is
+ * an unprotected supply of reclaimable memory from other cgroups.
+ * MEMCG_PROT_MIN: cgroup memory is protected
*
- * @root is exclusive; it is never low when looked at directly
+ * @root is exclusive; it is never protected when looked at directly
*
- * To provide a proper hierarchical behavior, effective memory.low value
- * is used.
+ * To provide a proper hierarchical behavior, effective memory.min/low values
+ * are used. Below is the description of how effective memory.low is calculated.
+ * Effective memory.min values is calculated in the same way.
*
* Effective memory.low is always equal or less than the original memory.low.
* If there is no memory.low overcommittment (which is always true for
@@ -5731,51 +5774,78 @@ struct cgroup_subsys memory_cgrp_subsys = {
* E/memory.current = 0
*
* These calculations require constant tracking of the actual low usages
- * (see propagate_low_usage()), as well as recursive calculation of
- * effective memory.low values. But as we do call mem_cgroup_low()
+ * (see propagate_protected_usage()), as well as recursive calculation of
+ * effective memory.low values. But as we do call mem_cgroup_protected()
* path for each memory cgroup top-down from the reclaim,
* it's possible to optimize this part, and save calculated elow
* for next usage. This part is intentionally racy, but it's ok,
* as memory.low is a best-effort mechanism.
*/
-bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
+enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
{
- unsigned long usage, low_usage, siblings_low_usage;
- unsigned long elow, parent_elow;
struct mem_cgroup *parent;
+ unsigned long emin, parent_emin;
+ unsigned long elow, parent_elow;
+ unsigned long usage;
if (mem_cgroup_disabled())
- return false;
+ return MEMCG_PROT_NONE;
if (!root)
root = root_mem_cgroup;
if (memcg == root)
- return false;
+ return MEMCG_PROT_NONE;
- elow = memcg->memory.low;
usage = page_counter_read(&memcg->memory);
- parent = parent_mem_cgroup(memcg);
+ if (!usage)
+ return MEMCG_PROT_NONE;
+
+ emin = memcg->memory.min;
+ elow = memcg->memory.low;
+ parent = parent_mem_cgroup(memcg);
if (parent == root)
goto exit;
+ parent_emin = READ_ONCE(parent->memory.emin);
+ emin = min(emin, parent_emin);
+ if (emin && parent_emin) {
+ unsigned long min_usage, siblings_min_usage;
+
+ min_usage = min(usage, memcg->memory.min);
+ siblings_min_usage = atomic_long_read(
+ &parent->memory.children_min_usage);
+
+ if (min_usage && siblings_min_usage)
+ emin = min(emin, parent_emin * min_usage /
+ siblings_min_usage);
+ }
+
parent_elow = READ_ONCE(parent->memory.elow);
elow = min(elow, parent_elow);
+ if (elow && parent_elow) {
+ unsigned long low_usage, siblings_low_usage;
- if (!elow || !parent_elow)
- goto exit;
+ low_usage = min(usage, memcg->memory.low);
+ siblings_low_usage = atomic_long_read(
+ &parent->memory.children_low_usage);
- low_usage = min(usage, memcg->memory.low);
- siblings_low_usage = atomic_long_read(
- &parent->memory.children_low_usage);
-
- if (!low_usage || !siblings_low_usage)
- goto exit;
+ if (low_usage && siblings_low_usage)
+ elow = min(elow, parent_elow * low_usage /
+ siblings_low_usage);
+ }
- elow = min(elow, parent_elow * low_usage / siblings_low_usage);
exit:
+ memcg->memory.emin = emin;
memcg->memory.elow = elow;
- return usage && usage <= elow;
+
+ if (usage <= emin)
+ return MEMCG_PROT_MIN;
+ else if (usage <= elow)
+ return MEMCG_PROT_LOW;
+ else
+ return MEMCG_PROT_NONE;
}
/**
diff --git a/mm/page_counter.c b/mm/page_counter.c
index a5ff4cbc355a..de31470655f6 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -13,26 +13,38 @@
#include <linux/bug.h>
#include <asm/page.h>
-static void propagate_low_usage(struct page_counter *c, unsigned long usage)
+static void propagate_protected_usage(struct page_counter *c,
+ unsigned long usage)
{
- unsigned long low_usage, old;
+ unsigned long protected, old_protected;
long delta;
if (!c->parent)
return;
- if (!c->low && !atomic_long_read(&c->low_usage))
- return;
+ if (c->min || atomic_long_read(&c->min_usage)) {
+ if (usage <= c->min)
+ protected = usage;
+ else
+ protected = 0;
+
+ old_protected = atomic_long_xchg(&c->min_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_min_usage);
+ }
- if (usage <= c->low)
- low_usage = usage;
- else
- low_usage = 0;
+ if (c->low || atomic_long_read(&c->low_usage)) {
+ if (usage <= c->low)
+ protected = usage;
+ else
+ protected = 0;
- old = atomic_long_xchg(&c->low_usage, low_usage);
- delta = low_usage - old;
- if (delta)
- atomic_long_add(delta, &c->parent->children_low_usage);
+ old_protected = atomic_long_xchg(&c->low_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_low_usage);
+ }
}
/**
@@ -45,7 +57,7 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
long new;
new = atomic_long_sub_return(nr_pages, &counter->usage);
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/* More uncharges than charges? */
WARN_ON_ONCE(new < 0);
}
@@ -65,7 +77,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
long new;
new = atomic_long_add_return(nr_pages, &c->usage);
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/*
* This is indeed racy, but we can live with some
* inaccuracy in the watermark.
@@ -109,7 +121,7 @@ bool page_counter_try_charge(struct page_counter *counter,
new = atomic_long_add_return(nr_pages, &c->usage);
if (new > c->max) {
atomic_long_sub(nr_pages, &c->usage);
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/*
* This is racy, but we can live with some
* inaccuracy in the failcnt.
@@ -118,7 +130,7 @@ bool page_counter_try_charge(struct page_counter *counter,
*fail = c;
goto failed;
}
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/*
* Just like with failcnt, we can live with some
* inaccuracy in the watermark.
@@ -190,6 +202,23 @@ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages)
}
}
+/**
+ * page_counter_set_min - set the amount of protected memory
+ * @counter: counter
+ * @nr_pages: value to set
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+
+ counter->min = nr_pages;
+
+ for (c = counter; c; c = c->parent)
+ propagate_protected_usage(c, atomic_long_read(&c->usage));
+}
+
/**
* page_counter_set_low - set the amount of protected memory
* @counter: counter
@@ -204,7 +233,7 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
counter->low = nr_pages;
for (c = counter; c; c = c->parent)
- propagate_low_usage(c, atomic_long_read(&c->usage));
+ propagate_protected_usage(c, atomic_long_read(&c->usage));
}
/**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10c8a38c5eef..50055d72f294 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2544,12 +2544,28 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
unsigned long reclaimed;
unsigned long scanned;
- if (mem_cgroup_low(root, memcg)) {
+ switch (mem_cgroup_protected(root, memcg)) {
+ case MEMCG_PROT_MIN:
+ /*
+ * Hard protection.
+ * If there is no reclaimable memory, OOM.
+ */
+ continue;
+ case MEMCG_PROT_LOW:
+ /*
+ * Soft protection.
+ * Respect the protection only as long as
+ * there is an unprotected supply
+ * of reclaimable memory from other cgroups.
+ */
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
continue;
}
memcg_memory_event(memcg, MEMCG_LOW);
+ break;
+ case MEMCG_PROT_NONE:
+ break;
}
reclaimed = sc->nr_reclaimed;
--
2.14.3
>
> Memory controller implements the memory.low best-effort memory
> protection mechanism, which works perfectly in many cases and
> allows protecting working sets of important workloads from
> sudden reclaim.
>
> But its semantics has a significant limitation: it works
> only as long as there is a supply of reclaimable memory.
> This makes it pretty useless against any sort of slow memory
> leaks or memory usage increases. This is especially true
> for swapless systems. If swap is enabled, memory soft protection
> effectively postpones problems, allowing a leaking application
> to fill all swap area, which makes no sense.
> The only effective way to guarantee the memory protection
> in this case is to invoke the OOM killer.
>
> It's possible to handle this case in userspace by reacting
> on MEMCG_LOW events; but there is still a place for a fail-safe
> in-kernel mechanism to provide stronger guarantees.
>
> This patch introduces the memory.min interface for cgroup v2
> memory controller. It works very similarly to memory.low
> (sharing the same hierarchical behavior), except that it's
> not disabled if there is no more reclaimable memory in the system.
>
> If cgroup is not populated, its memory.min is ignored,
> because otherwise even the OOM killer wouldn't be able
> to reclaim the protected memory, and the system can stall.
>
> ...
>
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1002,6 +1002,29 @@ PAGE_SIZE multiple when read back.
> The total amount of memory currently being used by the cgroup
> and its descendants.
>
> + memory.min
> + A read-write single value file which exists on non-root
> + cgroups. The default is "0".
> +
> + Hard memory protection. If the memory usage of a cgroup
> + is within its effective min boundary, the cgroup's memory
> + won't be reclaimed under any conditions. If there is no
> + unprotected reclaimable memory available, OOM killer
> + is invoked.
> +
> + Effective low boundary is limited by memory.min values of
> + all ancestor cgroups. If there is memory.min overcommitment
> + (child cgroup or cgroups are requiring more protected memory
> + than parent will allow), then each child cgroup will get
> + the part of parent's protection proportional to its
> + actual memory usage below memory.min.
> +
> + Putting more memory than generally available under this
> + protection is discouraged and may lead to constant OOMs.
> +
> + If a memory cgroup is not populated with processes,
> + its memory.min is ignored.
This is a copy-paste-edit of the memory.low description. Could we
please carefully check that it all remains accurate? Should "Effective
low boundary" be "Effective min boundary"? Does overcommit still apply
to .min? etcetera.
On Wed, May 09, 2018 at 03:38:05PM -0700, Andrew Morton wrote:
> >
> > Memory controller implements the memory.low best-effort memory
> > protection mechanism, which works perfectly in many cases and
> > allows protecting working sets of important workloads from
> > sudden reclaim.
> >
> > But its semantics has a significant limitation: it works
> > only as long as there is a supply of reclaimable memory.
> > This makes it pretty useless against any sort of slow memory
> > leaks or memory usage increases. This is especially true
> > for swapless systems. If swap is enabled, memory soft protection
> > effectively postpones problems, allowing a leaking application
> > to fill all swap area, which makes no sense.
> > The only effective way to guarantee the memory protection
> > in this case is to invoke the OOM killer.
> >
> > It's possible to handle this case in userspace by reacting
> > on MEMCG_LOW events; but there is still a place for a fail-safe
> > in-kernel mechanism to provide stronger guarantees.
> >
> > This patch introduces the memory.min interface for cgroup v2
> > memory controller. It works very similarly to memory.low
> > (sharing the same hierarchical behavior), except that it's
> > not disabled if there is no more reclaimable memory in the system.
> >
> > If cgroup is not populated, its memory.min is ignored,
> > because otherwise even the OOM killer wouldn't be able
> > to reclaim the protected memory, and the system can stall.
> >
> > ...
> >
> > --- a/Documentation/cgroup-v2.txt
> > +++ b/Documentation/cgroup-v2.txt
> > @@ -1002,6 +1002,29 @@ PAGE_SIZE multiple when read back.
> > The total amount of memory currently being used by the cgroup
> > and its descendants.
> >
> > + memory.min
> > + A read-write single value file which exists on non-root
> > + cgroups. The default is "0".
> > +
> > + Hard memory protection. If the memory usage of a cgroup
> > + is within its effective min boundary, the cgroup's memory
> > + won't be reclaimed under any conditions. If there is no
> > + unprotected reclaimable memory available, OOM killer
> > + is invoked.
> > +
> > + Effective low boundary is limited by memory.min values of
> > + all ancestor cgroups. If there is memory.min overcommitment
> > + (child cgroup or cgroups are requiring more protected memory
> > + than parent will allow), then each child cgroup will get
> > + the part of parent's protection proportional to its
> > + actual memory usage below memory.min.
> > +
> > + Putting more memory than generally available under this
> > + protection is discouraged and may lead to constant OOMs.
> > +
> > + If a memory cgroup is not populated with processes,
> > + its memory.min is ignored.
Hello, Andrew!
> This is a copy-paste-edit of the memory.low description. Could we
> please carefully check that it all remains accurate? Should "Effective
> low boundary" be "Effective min boundary"? Does overcommit still apply
> to .min? etcetera.
Except this s/low/min replacement (good catch, thank you! diff below),
the rest looks fine to me. Memory.min and memory.low are similar
in their hierarchical behavior, so most of the things still apply to .min.
Also, can you, please, add
Reviewed-by: Randy Dunlap <[email protected]>
(which was accidentally lost between versions).
Thanks you!
--
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 1764a627a120..f6725628bb4f 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1012,7 +1012,7 @@ PAGE_SIZE multiple when read back.
unprotected reclaimable memory available, OOM killer
is invoked.
- Effective low boundary is limited by memory.min values of
+ Effective min boundary is limited by memory.min values of
all ancestor cgroups. If there is memory.min overcommitment
(child cgroup or cgroups are requiring more protected memory
than parent will allow), then each child cgroup will get
On Thu 03-05-18 12:43:57, Roman Gushchin wrote:
> Memory controller implements the memory.low best-effort memory
> protection mechanism, which works perfectly in many cases and
> allows protecting working sets of important workloads from
> sudden reclaim.
>
> But its semantics has a significant limitation: it works
> only as long as there is a supply of reclaimable memory.
> This makes it pretty useless against any sort of slow memory
> leaks or memory usage increases. This is especially true
> for swapless systems. If swap is enabled, memory soft protection
> effectively postpones problems, allowing a leaking application
> to fill all swap area, which makes no sense.
> The only effective way to guarantee the memory protection
> in this case is to invoke the OOM killer.
>
> It's possible to handle this case in userspace by reacting
> on MEMCG_LOW events; but there is still a place for a fail-safe
> in-kernel mechanism to provide stronger guarantees.
>
> This patch introduces the memory.min interface for cgroup v2
> memory controller. It works very similarly to memory.low
> (sharing the same hierarchical behavior), except that it's
> not disabled if there is no more reclaimable memory in the system.
Originally I was pushing for the hard guarantee before we landed with
the best effort one. The assumption back then was that properly
configured systems shouldn't see problems IIRC.
It is not entirely clear to me what is the role of the low limit wrt.
leaking application from the above description TBH. I presume you have a
process without any low&hard limit which leaks and basically breaks the
low limit expectation because of the lack of reclaimable memory and our
memcg_low_reclaim fallback.
If that is the case then the hard limit should indeed protect the
respective memcg from reclaim. But what is the actuall guarantee?
We can reclaim that memory by the OOM killer, because there is no
protection from killing a memcg under the min limit. So what is the
actual semantic?
Also how is an admin supposed to configure those limits? low limit
doesn't reall protect in some cases so why should it be used at all?
I see how min matches max and low matches high, so there is a nice
symmetry but aren't we adding additional complexity to the API?
Isn't the real problem that the other guy (leaking application) doesn't
have any cap?
Please note I haven't looked at the implementation yet but I would like
to make sure I understand the whole concept first.
--
Michal Hocko
SUSE Labs
On Thu, May 10, 2018 at 03:30:03PM +0200, Michal Hocko wrote:
> On Thu 03-05-18 12:43:57, Roman Gushchin wrote:
> > Memory controller implements the memory.low best-effort memory
> > protection mechanism, which works perfectly in many cases and
> > allows protecting working sets of important workloads from
> > sudden reclaim.
> >
> > But its semantics has a significant limitation: it works
> > only as long as there is a supply of reclaimable memory.
> > This makes it pretty useless against any sort of slow memory
> > leaks or memory usage increases. This is especially true
> > for swapless systems. If swap is enabled, memory soft protection
> > effectively postpones problems, allowing a leaking application
> > to fill all swap area, which makes no sense.
> > The only effective way to guarantee the memory protection
> > in this case is to invoke the OOM killer.
> >
> > It's possible to handle this case in userspace by reacting
> > on MEMCG_LOW events; but there is still a place for a fail-safe
> > in-kernel mechanism to provide stronger guarantees.
> >
> > This patch introduces the memory.min interface for cgroup v2
> > memory controller. It works very similarly to memory.low
> > (sharing the same hierarchical behavior), except that it's
> > not disabled if there is no more reclaimable memory in the system.
>
> Originally I was pushing for the hard guarantee before we landed with
> the best effort one. The assumption back then was that properly
> configured systems shouldn't see problems IIRC.
Personally, I'm also not a big fan of the current memory.low semantics.
If you remember, my very version of memory guarantee (back to 2013)
implemented a hard approach: https://lwn.net/Articles/540240/
>
> It is not entirely clear to me what is the role of the low limit wrt.
> leaking application from the above description TBH. I presume you have a
> process without any low&hard limit which leaks and basically breaks the
> low limit expectation because of the lack of reclaimable memory and our
> memcg_low_reclaim fallback.
>
> If that is the case then the hard limit should indeed protect the
> respective memcg from reclaim. But what is the actuall guarantee?
> We can reclaim that memory by the OOM killer, because there is no
> protection from killing a memcg under the min limit. So what is the
> actual semantic?
If memcg memory usage is under its effective min boundary, its memory
won't be reclaimed.
Making OOM killer aware of memory guarantees is a separate topic
(and definitely a good idea to discuss!), but let's agree on
a simple fact that there are many workloads which prefer
to be killed, rather than suffer from a too high memory pressure.
>
> Also how is an admin supposed to configure those limits? low limit
> doesn't reall protect in some cases so why should it be used at all?
> I see how min matches max and low matches high, so there is a nice
> symmetry but aren't we adding additional complexity to the API?
> Isn't the real problem that the other guy (leaking application) doesn't
> have any cap?
My main point is that memory.low requires an userspace agent
to actually guarantee something. This agent supposed to track
low memory events and somehow decrease memory pressure,
if memory.low watermark is reached (stop some workloads, for example).
This is not always handy, and having strong guarantee makes sense, IMO.
We're experimenting with different setups, and the current approach
is to set memory.min to the minimal value which guarantees normal
functioning of a workload, while memory.low can be set to a much higher
value, which sometimes brings some performance gains.
Thanks!