Some applications use memory cgroup limits to scale their own memory
needs. Reading of the immediate membership cgroup's memory.max is not
sufficient because of possible ancestral limits. The application could
traverse upwards to figure out the tightest limit but this would not
work in cgroup namespace where the view of cgroup hierarchy is
incomplete and the limit may apply from outer world.
Additionally, applications should respond to limit changes.
(cgroup v1 used memory.stat:hierarchical_memory_limit to report the
value but there's no such counterpart in cgroup v2 memory.stat.)
Introduce a new memcg attribute file that contains the effective value
of memory limit for given cgroup (following cpuset.cpus.effective
pattern) and that sends notifications like memory.events when the
effective limit changes.
Reasons for RFC:
1) Should global limit be included? (And respond to memory hotplug?)
2) Is swap.max.effective needed? (in v2 without memsw accounting)
3) Should memory.high be also handled?
4) What would be an alternative?
My answers to RFC:
1) No (there's no memory.max in global root memcg)
2) No (app doesn't have full control of memory that's swapped out)
3) No (scaling the allocator against the "soft" limit could end up in
dynamics difficult to reason and admin)
4)
- PSI (too obscure for traditional users but better semantics for limit
shrinking)
- memory.stat field (like v1 but separate attribute is better for
notifications, cpuset precedent)
Changes from v4 (https://lore.kernel.org/r/[email protected])
- split the patch for swap.max.effetive
- add Documentation/
- reword commit messages
- add notification support
Michal Koutný (3):
memcg: Add memory.max.effective attribute
memcg: Add memory.swap.max.effective like hierarchical_memsw_limit
memcg: Notify on memory.max.effective changes
Documentation/admin-guide/cgroup-v2.rst | 6 ++++
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 46 +++++++++++++++++++++++++
3 files changed, 54 insertions(+)
base-commit: 2df0193e62cf887f373995fb8a91068562784adc
--
2.45.1
cgroup v1 used memory.stat:hierarchical_memsw_limit to report the value
of effecitve memsw limit. cgroup v2 has no combined charing but swap.max
limit, add a new memcg attribute file that contains the effective value
of memory limit for given cgroup (following cpuset.cpus.effective
pattern) for cases when whole hierarchy cannot be traversed up due to
cgroupns visibility.
Signed-off-by: Jan Kratochvil (Azul) <[email protected]>
[ mkoutny: rewrite commit message, only memory.swap.max change]
Signed-off-by: Michal Koutný <[email protected]>
---
mm/memcontrol.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 86bcec84fe7b..a889385f6033 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -8279,6 +8279,19 @@ static ssize_t swap_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int swap_max_effective_show(struct seq_file *m, void *v)
+{
+ unsigned long swap;
+ struct mem_cgroup *mi;
+
+ /* Hierarchical information */
+ swap = PAGE_COUNTER_MAX;
+ for (mi = mem_cgroup_from_seq(m); mi; mi = parent_mem_cgroup(mi))
+ swap = min(swap, READ_ONCE(mi->swap.max));
+
+ return seq_puts_memcg_tunable(m, swap);
+}
+
static int swap_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
@@ -8311,6 +8324,11 @@ static struct cftype swap_files[] = {
.seq_show = swap_max_show,
.write = swap_max_write,
},
+ {
+ .name = "swap.max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = swap_max_effective_show,
+ },
{
.name = "swap.peak",
.flags = CFTYPE_NOT_ON_ROOT,
--
2.45.1
Some applications use memory cgroup limits to scale their own memory
needs. Reading of the immediate membership cgroup's memory.max is not
sufficient because of possible ancestral limits. The application could
traverse upwards to figure out the tightest limit but this would not
work in cgroup namespace where the view of cgroup hierarchy is
incomplete and the limit may apply from outer world.
(cgroup v1 used memory.stat:hierarchical_memory_limit to report the
value but there's no such counterpart in cgroup v2 memory.stat.)
Introduce a new memcg attribute file that contains the effective value
of memory limit for given cgroup (following cpuset.cpus.effective
pattern).
Signed-off-by: Jan Kratochvil (Azul) <[email protected]>
[ mkoutny: rewrite commit message, split out memory.swap.max]
Signed-off-by: Michal Koutný <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
mm/memcontrol.c | 18 ++++++++++++++++++
2 files changed, 24 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8fbb0519d556..988f26264054 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1293,6 +1293,12 @@ PAGE_SIZE multiple when read back.
Caller could retry them differently, return into userspace
as -ENOMEM or silently ignore in cases like disk readahead.
+ memory.max.effective
+ A read-only file that provides effective value of cgroup's hard usage
+ limit. It incorporates limits of all ancestors, even those not visible
+ in cgroupns. The value change in this file generates a file modified
+ event.
+
memory.reclaim
A write-only nested-keyed file which exists for all cgroups.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7fad15b2290c..86bcec84fe7b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7065,6 +7065,19 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_max_effective_show(struct seq_file *m, void *v)
+{
+ unsigned long memory;
+ struct mem_cgroup *mi;
+
+ /* Hierarchical information */
+ memory = PAGE_COUNTER_MAX;
+ for (mi = mem_cgroup_from_seq(m); mi; mi = parent_mem_cgroup(mi))
+ memory = min(memory, READ_ONCE(mi->memory.max));
+
+ return seq_puts_memcg_tunable(m, memory);
+}
+
/*
* Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
* if any new events become available.
@@ -7259,6 +7272,11 @@ static struct cftype memory_files[] = {
.seq_show = memory_max_show,
.write = memory_max_write,
},
+ {
+ .name = "max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_max_effective_show,
+ },
{
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
--
2.45.1
When users are interested in cgroup's effective limit, they typically
respond to the value somehow and therefore they should be notified when
the value changes. Use the standard menchanism of triggering a
modification of respective cgroup file.
Signed-off-by: Michal Koutný <[email protected]>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 10 ++++++++++
2 files changed, 12 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 030d34e9d117..79ecbbd87c4c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -232,6 +232,8 @@ struct mem_cgroup {
/* memory.events and memory.events.local */
struct cgroup_file events_file;
struct cgroup_file events_local_file;
+ /* memory.max.effective */
+ struct cgroup_file mem_max_file;
/* handle for "memory.swap.events" */
struct cgroup_file swap_events_file;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a889385f6033..72c8e4693506 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7022,6 +7022,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ struct mem_cgroup *iter;
unsigned int nr_reclaims = MAX_RECLAIM_RETRIES;
bool drained = false;
unsigned long max;
@@ -7061,6 +7062,14 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
break;
}
+ /*
+ * Notification about limit tightening, not about coming OOMs, so it
+ * can be after reclaim.
+ */
+ for_each_mem_cgroup_tree(iter, memcg) {
+ cgroup_file_notify(&iter->mem_max_file);
+ }
+
memcg_wb_domain_size_changed(memcg);
return nbytes;
}
@@ -7275,6 +7284,7 @@ static struct cftype memory_files[] = {
{
.name = "max.effective",
.flags = CFTYPE_NOT_ON_ROOT,
+ .file_offset = offsetof(struct mem_cgroup, mem_max_file),
.seq_show = memory_max_effective_show,
},
{
--
2.45.1
On Thu, Jun 06, 2024 at 05:22:29PM +0200, Michal Koutn? wrote:
> Some applications use memory cgroup limits to scale their own memory
> needs. Reading of the immediate membership cgroup's memory.max is not
> sufficient because of possible ancestral limits. The application could
> traverse upwards to figure out the tightest limit but this would not
> work in cgroup namespace where the view of cgroup hierarchy is
> incomplete and the limit may apply from outer world.
> Additionally, applications should respond to limit changes.
If the goal is to detect how much memory would it be possible to allocate,
I'm not sure that knowing all memory.max limits upper in the hierarchy
really buys anything without knowing actual usages and a potential
for memory reclaim across the entire tree.
E.g.:
A (max = 100G)
| \
B C
C's effective max will come out as 100G, but if B.anon_usage = 100G and
there is no swap, the actual number is 0.
But if it's more about exploring the "invisible" part of the cgroup
tree configuration, it makes sense to me.
Not sure about the naming, maybe something like memory.tree.max
or memory.parent.max or even memory.hierarchical.max is a better fit.
Thanks!