2023-04-26 13:40:19

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH 0/2] memcg: OOM log improvements

This short patch series brings back some cgroup v1 stats in OOM logs,
and it makes memcg OOM logging less reliant on printk() internals.

The series uses seq_buf_do_printk() which was only recently introduced
[1]. It did not land in Linus's tree yet, but ideally it will land this
merge window. I thought I would share the patches meanwhile for
feedback.

[1]https://lore.kernel.org/lkml/[email protected]/

Yosry Ahmed (2):
memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo()
memcg: dump memory.stat during cgroup OOM for v1

mm/memcontrol.c | 85 ++++++++++++++++++++++++++++---------------------
1 file changed, 48 insertions(+), 37 deletions(-)

--
2.40.1.495.gc816e09b53d-goog


2023-04-26 13:41:06

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH 1/2] memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo()

Currently, we format all the memcg stats into a buffer in
mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs.
However, this buffer is large in size. Although it is currently working
as intended, ther is a dependency between the memcg stats buffer and the
printk record size limit.

If we add more stats in the future and the buffer becomes larger than
the printk record size limit, or if the prink record size limit is
reduced, the logs may be truncated.

It is safer to use seq_buf_do_printk(), which will automatically break
up the buffer at line breaks and issue small printk() calls.

Refactor the code to move the seq_buf from memory_stat_format() to its
callers, and use seq_buf_do_printk() to print the seq_buf in
mem_cgroup_print_oom_meminfo().

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5abffe6f8389..5922940f92c9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1551,13 +1551,10 @@ static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
}

-static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
+static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
{
- struct seq_buf s;
int i;

- seq_buf_init(&s, buf, bufsize);
-
/*
* Provide statistics on the state of the memory subsystem as
* well as cumulative event counters that show past behavior.
@@ -1574,21 +1571,21 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
u64 size;

size = memcg_page_state_output(memcg, memory_stats[i].idx);
- seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
+ seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);

if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
size += memcg_page_state_output(memcg,
NR_SLAB_RECLAIMABLE_B);
- seq_buf_printf(&s, "slab %llu\n", size);
+ seq_buf_printf(s, "slab %llu\n", size);
}
}

/* Accumulated memory events */
- seq_buf_printf(&s, "pgscan %lu\n",
+ seq_buf_printf(s, "pgscan %lu\n",
memcg_events(memcg, PGSCAN_KSWAPD) +
memcg_events(memcg, PGSCAN_DIRECT) +
memcg_events(memcg, PGSCAN_KHUGEPAGED));
- seq_buf_printf(&s, "pgsteal %lu\n",
+ seq_buf_printf(s, "pgsteal %lu\n",
memcg_events(memcg, PGSTEAL_KSWAPD) +
memcg_events(memcg, PGSTEAL_DIRECT) +
memcg_events(memcg, PGSTEAL_KHUGEPAGED));
@@ -1598,13 +1595,13 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
memcg_vm_event_stat[i] == PGPGOUT)
continue;

- seq_buf_printf(&s, "%s %lu\n",
+ seq_buf_printf(s, "%s %lu\n",
vm_event_name(memcg_vm_event_stat[i]),
memcg_events(memcg, memcg_vm_event_stat[i]));
}

/* The above should easily fit into one page */
- WARN_ON_ONCE(seq_buf_has_overflowed(&s));
+ WARN_ON_ONCE(seq_buf_has_overflowed(s));
}

#define K(x) ((x) << (PAGE_SHIFT-10))
@@ -1642,6 +1639,7 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
{
/* Use static buffer, for the caller is holding oom_lock. */
static char buf[PAGE_SIZE];
+ struct seq_buf s;

lockdep_assert_held(&oom_lock);

@@ -1664,8 +1662,9 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
pr_info("Memory cgroup stats for ");
pr_cont_cgroup_path(memcg->css.cgroup);
pr_cont(":");
- memory_stat_format(memcg, buf, sizeof(buf));
- pr_info("%s", buf);
+ seq_buf_init(&s, buf, sizeof(buf));
+ memory_stat_format(memcg, &s);
+ seq_buf_do_printk(&s, KERN_INFO);
}

/*
@@ -6573,10 +6572,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
char *buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ struct seq_buf s;

if (!buf)
return -ENOMEM;
- memory_stat_format(memcg, buf, PAGE_SIZE);
+ seq_buf_init(&s, buf, PAGE_SIZE);
+ memory_stat_format(memcg, &s);
seq_puts(m, buf);
kfree(buf);
return 0;
--
2.40.1.495.gc816e09b53d-goog

2023-04-26 14:10:34

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH 2/2] memcg: dump memory.stat during cgroup OOM for v1

Commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup
OOM") made sure we dump all the stats in memory.stat during a cgroup
OOM, but it also introduced a slight behavioral change. The code used to
print the non-hierarchical v1 cgroup stats for the entire cgroup
subtree, not it only prints the v2 cgroup stats for the cgroup under
OOM.

Although v2 stats are a superset of v1 stats, some of them have
different naming. We also lost the non-hierarchical stats for the cgroup
under OOM in v1.

Move the v2 specific formatting from memory_stat_format() to
memcg_stat_format(), add memcg1_stat_format() for v1, and make
memory_stat_format() select between them based on cgroup version.
Since memory_stat_show() now works for both v1 & v2, drop
memcg_stat_show().

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 60 ++++++++++++++++++++++++++++---------------------
1 file changed, 35 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5922940f92c9..2b492f8d540c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1551,7 +1551,7 @@ static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
}

-static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
+static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
{
int i;

@@ -1604,6 +1604,17 @@ static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
WARN_ON_ONCE(seq_buf_has_overflowed(s));
}

+static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
+
+static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
+{
+ if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ memcg_stat_format(memcg, s);
+ else
+ memcg1_stat_format(memcg, s);
+ WARN_ON_ONCE(seq_buf_has_overflowed(s));
+}
+
#define K(x) ((x) << (PAGE_SHIFT-10))
/**
* mem_cgroup_print_oom_context: Print OOM information relevant to
@@ -4078,9 +4089,8 @@ static const unsigned int memcg1_events[] = {
PGMAJFAULT,
};

-static int memcg_stat_show(struct seq_file *m, void *v)
+static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
{
- struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
unsigned long memory, memsw;
struct mem_cgroup *mi;
unsigned int i;
@@ -4095,18 +4105,18 @@ static int memcg_stat_show(struct seq_file *m, void *v)
if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
continue;
nr = memcg_page_state_local(memcg, memcg1_stats[i]);
- seq_printf(m, "%s %lu\n", memcg1_stat_names[i],
+ seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i],
nr * memcg_page_state_unit(memcg1_stats[i]));
}

for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
- seq_printf(m, "%s %lu\n", vm_event_name(memcg1_events[i]),
- memcg_events_local(memcg, memcg1_events[i]));
+ seq_buf_printf(s, "%s %lu\n", vm_event_name(memcg1_events[i]),
+ memcg_events_local(memcg, memcg1_events[i]));

for (i = 0; i < NR_LRU_LISTS; i++)
- seq_printf(m, "%s %lu\n", lru_list_name(i),
- memcg_page_state_local(memcg, NR_LRU_BASE + i) *
- PAGE_SIZE);
+ seq_buf_printf(s, "%s %lu\n", lru_list_name(i),
+ memcg_page_state_local(memcg, NR_LRU_BASE + i) *
+ PAGE_SIZE);

/* Hierarchical information */
memory = memsw = PAGE_COUNTER_MAX;
@@ -4114,11 +4124,11 @@ static int memcg_stat_show(struct seq_file *m, void *v)
memory = min(memory, READ_ONCE(mi->memory.max));
memsw = min(memsw, READ_ONCE(mi->memsw.max));
}
- seq_printf(m, "hierarchical_memory_limit %llu\n",
- (u64)memory * PAGE_SIZE);
+ seq_buf_printf(s, "hierarchical_memory_limit %llu\n",
+ (u64)memory * PAGE_SIZE);
if (do_memsw_account())
- seq_printf(m, "hierarchical_memsw_limit %llu\n",
- (u64)memsw * PAGE_SIZE);
+ seq_buf_printf(s, "hierarchical_memsw_limit %llu\n",
+ (u64)memsw * PAGE_SIZE);

for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
unsigned long nr;
@@ -4126,19 +4136,19 @@ static int memcg_stat_show(struct seq_file *m, void *v)
if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
continue;
nr = memcg_page_state(memcg, memcg1_stats[i]);
- seq_printf(m, "total_%s %llu\n", memcg1_stat_names[i],
+ seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i],
(u64)nr * memcg_page_state_unit(memcg1_stats[i]));
}

for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
- seq_printf(m, "total_%s %llu\n",
- vm_event_name(memcg1_events[i]),
- (u64)memcg_events(memcg, memcg1_events[i]));
+ seq_buf_printf(s, "total_%s %llu\n",
+ vm_event_name(memcg1_events[i]),
+ (u64)memcg_events(memcg, memcg1_events[i]));

for (i = 0; i < NR_LRU_LISTS; i++)
- seq_printf(m, "total_%s %llu\n", lru_list_name(i),
- (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
- PAGE_SIZE);
+ seq_buf_printf(s, "total_%s %llu\n", lru_list_name(i),
+ (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
+ PAGE_SIZE);

#ifdef CONFIG_DEBUG_VM
{
@@ -4153,12 +4163,10 @@ static int memcg_stat_show(struct seq_file *m, void *v)
anon_cost += mz->lruvec.anon_cost;
file_cost += mz->lruvec.file_cost;
}
- seq_printf(m, "anon_cost %lu\n", anon_cost);
- seq_printf(m, "file_cost %lu\n", file_cost);
+ seq_buf_printf(s, "anon_cost %lu\n", anon_cost);
+ seq_buf_printf(s, "file_cost %lu\n", file_cost);
}
#endif
-
- return 0;
}

static u64 mem_cgroup_swappiness_read(struct cgroup_subsys_state *css,
@@ -4998,6 +5006,8 @@ static int mem_cgroup_slab_show(struct seq_file *m, void *p)
}
#endif

+static int memory_stat_show(struct seq_file *m, void *v);
+
static struct cftype mem_cgroup_legacy_files[] = {
{
.name = "usage_in_bytes",
@@ -5030,7 +5040,7 @@ static struct cftype mem_cgroup_legacy_files[] = {
},
{
.name = "stat",
- .seq_show = memcg_stat_show,
+ .seq_show = memory_stat_show,
},
{
.name = "force_empty",
--
2.40.1.495.gc816e09b53d-goog

2023-04-26 14:20:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/2] memcg: OOM log improvements

On Wed, 26 Apr 2023 13:39:17 +0000
Yosry Ahmed <[email protected]> wrote:

> The series uses seq_buf_do_printk() which was only recently introduced
> [1]. It did not land in Linus's tree yet, but ideally it will land this
> merge window. I thought I would share the patches meanwhile for
> feedback.
>
> [1]https://lore.kernel.org/lkml/[email protected]/

FYI, it's running through my tests right now (with several other patches).

If everything passes, I'll let it sit in linux-next for a day or two than
push to Linus at the end of the week.

-- Steve

2023-04-26 14:31:20

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 0/2] memcg: OOM log improvements

On Wed, Apr 26, 2023 at 7:19 AM Steven Rostedt <[email protected]> wrote:
>
> On Wed, 26 Apr 2023 13:39:17 +0000
> Yosry Ahmed <[email protected]> wrote:
>
> > The series uses seq_buf_do_printk() which was only recently introduced
> > [1]. It did not land in Linus's tree yet, but ideally it will land this
> > merge window. I thought I would share the patches meanwhile for
> > feedback.
> >
> > [1]https://lore.kernel.org/lkml/[email protected]/
>
> FYI, it's running through my tests right now (with several other patches).
>
> If everything passes, I'll let it sit in linux-next for a day or two than
> push to Linus at the end of the week.
>
> -- Steve

Great! Thanks for the update!

2023-04-26 15:27:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo()

On Wed 26-04-23 13:39:18, Yosry Ahmed wrote:
> Currently, we format all the memcg stats into a buffer in
> mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs.
> However, this buffer is large in size. Although it is currently working
> as intended, ther is a dependency between the memcg stats buffer and the
> printk record size limit.
>
> If we add more stats in the future and the buffer becomes larger than
> the printk record size limit, or if the prink record size limit is
> reduced, the logs may be truncated.
>
> It is safer to use seq_buf_do_printk(), which will automatically break
> up the buffer at line breaks and issue small printk() calls.
>
> Refactor the code to move the seq_buf from memory_stat_format() to its
> callers, and use seq_buf_do_printk() to print the seq_buf in
> mem_cgroup_print_oom_meminfo().
>
> Signed-off-by: Yosry Ahmed <[email protected]>

No objection from me but is it possible that more printk calls (one per
line with this change correct?) would add a contention on the printk
path?

Acked-by: Michal Hocko <[email protected]>

> ---
> mm/memcontrol.c | 27 ++++++++++++++-------------
> 1 file changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5abffe6f8389..5922940f92c9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1551,13 +1551,10 @@ static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
> return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
> }
>
> -static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
> +static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> {
> - struct seq_buf s;
> int i;
>
> - seq_buf_init(&s, buf, bufsize);
> -
> /*
> * Provide statistics on the state of the memory subsystem as
> * well as cumulative event counters that show past behavior.
> @@ -1574,21 +1571,21 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
> u64 size;
>
> size = memcg_page_state_output(memcg, memory_stats[i].idx);
> - seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
> + seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
>
> if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> size += memcg_page_state_output(memcg,
> NR_SLAB_RECLAIMABLE_B);
> - seq_buf_printf(&s, "slab %llu\n", size);
> + seq_buf_printf(s, "slab %llu\n", size);
> }
> }
>
> /* Accumulated memory events */
> - seq_buf_printf(&s, "pgscan %lu\n",
> + seq_buf_printf(s, "pgscan %lu\n",
> memcg_events(memcg, PGSCAN_KSWAPD) +
> memcg_events(memcg, PGSCAN_DIRECT) +
> memcg_events(memcg, PGSCAN_KHUGEPAGED));
> - seq_buf_printf(&s, "pgsteal %lu\n",
> + seq_buf_printf(s, "pgsteal %lu\n",
> memcg_events(memcg, PGSTEAL_KSWAPD) +
> memcg_events(memcg, PGSTEAL_DIRECT) +
> memcg_events(memcg, PGSTEAL_KHUGEPAGED));
> @@ -1598,13 +1595,13 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
> memcg_vm_event_stat[i] == PGPGOUT)
> continue;
>
> - seq_buf_printf(&s, "%s %lu\n",
> + seq_buf_printf(s, "%s %lu\n",
> vm_event_name(memcg_vm_event_stat[i]),
> memcg_events(memcg, memcg_vm_event_stat[i]));
> }
>
> /* The above should easily fit into one page */
> - WARN_ON_ONCE(seq_buf_has_overflowed(&s));
> + WARN_ON_ONCE(seq_buf_has_overflowed(s));
> }
>
> #define K(x) ((x) << (PAGE_SHIFT-10))
> @@ -1642,6 +1639,7 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
> {
> /* Use static buffer, for the caller is holding oom_lock. */
> static char buf[PAGE_SIZE];
> + struct seq_buf s;
>
> lockdep_assert_held(&oom_lock);
>
> @@ -1664,8 +1662,9 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
> pr_info("Memory cgroup stats for ");
> pr_cont_cgroup_path(memcg->css.cgroup);
> pr_cont(":");
> - memory_stat_format(memcg, buf, sizeof(buf));
> - pr_info("%s", buf);
> + seq_buf_init(&s, buf, sizeof(buf));
> + memory_stat_format(memcg, &s);
> + seq_buf_do_printk(&s, KERN_INFO);
> }
>
> /*
> @@ -6573,10 +6572,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> char *buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> + struct seq_buf s;
>
> if (!buf)
> return -ENOMEM;
> - memory_stat_format(memcg, buf, PAGE_SIZE);
> + seq_buf_init(&s, buf, PAGE_SIZE);
> + memory_stat_format(memcg, &s);
> seq_puts(m, buf);
> kfree(buf);
> return 0;
> --
> 2.40.1.495.gc816e09b53d-goog

--
Michal Hocko
SUSE Labs

2023-04-26 15:35:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: dump memory.stat during cgroup OOM for v1

On Wed 26-04-23 13:39:19, Yosry Ahmed wrote:
> Commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup
> OOM") made sure we dump all the stats in memory.stat during a cgroup
> OOM, but it also introduced a slight behavioral change. The code used to
> print the non-hierarchical v1 cgroup stats for the entire cgroup
> subtree, not it only prints the v2 cgroup stats for the cgroup under
> OOM.
>
> Although v2 stats are a superset of v1 stats, some of them have
> different naming. We also lost the non-hierarchical stats for the cgroup
> under OOM in v1.

Why is that a problem worth solving? It would be also nice to add an
example of the oom report before and after the patch.
--
Michal Hocko
SUSE Labs

2023-04-27 00:47:07

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo()

On (23/04/26 13:39), Yosry Ahmed wrote:
> Currently, we format all the memcg stats into a buffer in
> mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs.
> However, this buffer is large in size. Although it is currently working
> as intended, ther is a dependency between the memcg stats buffer and the
> printk record size limit.
>
> If we add more stats in the future and the buffer becomes larger than
> the printk record size limit, or if the prink record size limit is
> reduced, the logs may be truncated.
>
> It is safer to use seq_buf_do_printk(), which will automatically break
> up the buffer at line breaks and issue small printk() calls.
>
> Refactor the code to move the seq_buf from memory_stat_format() to its
> callers, and use seq_buf_do_printk() to print the seq_buf in
> mem_cgroup_print_oom_meminfo().
>
> Signed-off-by: Yosry Ahmed <[email protected]>

FWIW,
Reviewed-by: Sergey Senozhatsky <[email protected]>

2023-04-27 01:02:43

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo()

On (23/04/26 17:24), Michal Hocko wrote:
> No objection from me but is it possible that more printk calls (one per
> line with this change correct?) would add a contention on the printk
> path?

It probably will have opposite effect: console->write of longer lines
keep local IRQs disabled longer and keep console_waiter printk spinning
(in console_trylock_spinning()) longer.

2023-04-27 09:22:49

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: dump memory.stat during cgroup OOM for v1

On Wed, Apr 26, 2023 at 8:27 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 26-04-23 13:39:19, Yosry Ahmed wrote:
> > Commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup
> > OOM") made sure we dump all the stats in memory.stat during a cgroup
> > OOM, but it also introduced a slight behavioral change. The code used to
> > print the non-hierarchical v1 cgroup stats for the entire cgroup
> > subtree, not it only prints the v2 cgroup stats for the cgroup under
> > OOM.
> >
> > Although v2 stats are a superset of v1 stats, some of them have
> > different naming. We also lost the non-hierarchical stats for the cgroup
> > under OOM in v1.
>
> Why is that a problem worth solving? It would be also nice to add an
> example of the oom report before and after the patch.
> --
> Michal Hocko
> SUSE Labs

Thanks for taking a look!

The problem is that when upgrading to a kernel that contains
c8713d0b2312 on cgroup v1, the OOM logs suddenly change. The stats
names become different, a couple of stats are gone, and the
non-hierarchical stats disappear.

The non-hierarchical stats are important to identify if a memcg OOM'd
because of the memory consumption of its own processes or its
descendants. In the example below, I created a parent memcg "a", and a
child memcg "b". A process in "a" itself ("tail" in this case) is
hogging memory and causing an OOM, not the processes in the child "b"
(the "sleep" processes). With non-hierarchical stats, it's clear that
this is the case.

Also, it is generally nice to keep things consistent as much as
possible. The sudden change of the OOM log with the kernel upgrade is
confusing, especially that the memcg stats in the OOM logs in cgroup
v1 now look different from the stats in memory.stat. This patch
restores the consistency for cgroup v1, without affecting cgroup v2.
IMO, it's also a nice cleanup to have the stats formatting code be
consistent across cgroup v1 and v2. I personally didn't like the
memory_stat_format() vs. memcg_stat_show() distinction.

Here is a sample of the OOM logs from the scenario described above:

Before:
[ 88.339330] memory: usage 10240kB, limit 10240kB, failcnt 54
[ 88.339340] memory+swap: usage 10240kB, limit 9007199254740988kB, failcnt 0
[ 88.339347] kmem: usage 552kB, limit 9007199254740988kB, failcnt 0
[ 88.339348] Memory cgroup stats for /a:
[ 88.339458] anon 9900032
[ 88.339483] file 0
[ 88.339483] kernel 565248
[ 88.339484] kernel_stack 0
[ 88.339485] pagetables 294912
[ 88.339486] sec_pagetables 0
[ 88.339486] percpu 15584
[ 88.339487] sock 0
[ 88.339487] vmalloc 0
[ 88.339488] shmem 0
[ 88.339488] zswap 0
[ 88.339489] zswapped 0
[ 88.339489] file_mapped 0
[ 88.339490] file_dirty 0
[ 88.339490] file_writeback 0
[ 88.339491] swapcached 0
[ 88.339491] anon_thp 2097152
[ 88.339492] file_thp 0
[ 88.339492] shmem_thp 0
[ 88.339497] inactive_anon 9797632
[ 88.339498] active_anon 45056
[ 88.339498] inactive_file 0
[ 88.339499] active_file 0
[ 88.339499] unevictable 0
[ 88.339500] slab_reclaimable 19888
[ 88.339500] slab_unreclaimable 42752
[ 88.339501] slab 62640
[ 88.339501] workingset_refault_anon 0
[ 88.339502] workingset_refault_file 0
[ 88.339502] workingset_activate_anon 0
[ 88.339503] workingset_activate_file 0
[ 88.339503] workingset_restore_anon 0
[ 88.339504] workingset_restore_file 0
[ 88.339504] workingset_nodereclaim 0
[ 88.339505] pgscan 0
[ 88.339505] pgsteal 0
[ 88.339506] pgscan_kswapd 0
[ 88.339506] pgscan_direct 0
[ 88.339507] pgscan_khugepaged 0
[ 88.339507] pgsteal_kswapd 0
[ 88.339508] pgsteal_direct 0
[ 88.339508] pgsteal_khugepaged 0
[ 88.339509] pgfault 2750
[ 88.339509] pgmajfault 0
[ 88.339510] pgrefill 0
[ 88.339510] pgactivate 1
[ 88.339511] pgdeactivate 0
[ 88.339511] pglazyfree 0
[ 88.339512] pglazyfreed 0
[ 88.339512] zswpin 0
[ 88.339513] zswpout 0
[ 88.339513] thp_fault_alloc 0
[ 88.339514] thp_collapse_alloc 1
[ 88.339514] Tasks state (memory values in pages):
[ 88.339515] [ pid ] uid tgid total_vm rss pgtables_bytes
swapents oom_score_adj name
[ 88.339516] [ 108] 0 108 2986 2624 61440
0 0 tail
[ 88.339525] [ 97] 0 97 724 352 32768
0 0 sleep
[ 88.339538] [ 99] 0 99 724 352 32768
0 0 sleep
[ 88.339541] [ 98] 0 98 724 320 32768
0 0 sleep
[ 88.339542] [ 101] 0 101 724 320 32768
0 0 sleep
[ 88.339544] [ 102] 0 102 724 352 32768
0 0 sleep
[ 88.339546] [ 103] 0 103 724 352 32768
0 0 sleep
[ 88.339548] [ 104] 0 104 724 352 32768
0 0 sleep
[ 88.339549] [ 105] 0 105 724 352 32768
0 0 sleep
[ 88.339551] [ 100] 0 100 724 352 32768
0 0 sleep
[ 88.339558] [ 106] 0 106 724 352 32768
0 0 sleep
[ 88.339563] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-2,oom_memcg=/a,task_memcg=/a,task=tail,pid=108,uid0
[ 88.339588] Memory cgroup out of memory: Killed process 108 (tail)
total-vm:11944kB, anon-rss:9216kB, file-rss:0kB, shmem-rss:1280kB,
UID:00


After:
[ 74.447997] memory: usage 10240kB, limit 10240kB, failcnt 116
[ 74.447998] memory+swap: usage 10240kB, limit 9007199254740988kB, failcnt 0
[ 74.448000] kmem: usage 548kB, limit 9007199254740988kB, failcnt 0
[ 74.448001] Memory cgroup stats for /a:
[ 74.448103] cache 0
[ 74.448104] rss 9433088
[ 74.448105] rss_huge 2097152
[ 74.448105] shmem 0
[ 74.448106] mapped_file 0
[ 74.448106] dirty 0
[ 74.448107] writeback 0
[ 74.448107] workingset_refault_anon 0
[ 74.448108] workingset_refault_file 0
[ 74.448109] swap 0
[ 74.448109] pgpgin 2304
[ 74.448110] pgpgout 512
[ 74.448111] pgfault 2332
[ 74.448111] pgmajfault 0
[ 74.448112] inactive_anon 9388032
[ 74.448112] active_anon 4096
[ 74.448113] inactive_file 0
[ 74.448113] active_file 0
[ 74.448114] unevictable 0
[ 74.448114] hierarchical_memory_limit 10485760
[ 74.448115] hierarchical_memsw_limit 9223372036854771712
[ 74.448116] total_cache 0
[ 74.448116] total_rss 9818112
[ 74.448117] total_rss_huge 2097152
[ 74.448118] total_shmem 0
[ 74.448118] total_mapped_file 0
[ 74.448119] total_dirty 0
[ 74.448119] total_writeback 0
[ 74.448120] total_workingset_refault_anon 0
[ 74.448120] total_workingset_refault_file 0
[ 74.448121] total_swap 0
[ 74.448121] total_pgpgin 2407
[ 74.448121] total_pgpgout 521
[ 74.448122] total_pgfault 2734
[ 74.448122] total_pgmajfault 0
[ 74.448123] total_inactive_anon 9715712
[ 74.448123] total_active_anon 45056
[ 74.448124] total_inactive_file 0
[ 74.448124] total_active_file 0
[ 74.448125] total_unevictable 0
[ 74.448125] Tasks state (memory values in pages):
[ 74.448126] [ pid ] uid tgid total_vm rss pgtables_bytes
swapents oom_score_adj name
[ 74.448127] [ 107] 0 107 2982 2592 61440
0 0 tail
[ 74.448131] [ 97] 0 97 724 352 32768
0 0 sleep
[ 74.448134] [ 98] 0 98 724 352 32768
0 0 sleep
[ 74.448136] [ 99] 0 99 724 352 32768
0 0 sleep
[ 74.448137] [ 101] 0 101 724 352 32768
0 0 sleep
[ 74.448139] [ 102] 0 102 724 352 32768
0 0 sleep
[ 74.448141] [ 103] 0 103 724 352 28672
0 0 sleep
[ 74.448143] [ 104] 0 104 724 352 32768
0 0 sleep
[ 74.448144] [ 105] 0 105 724 352 32768
0 0 sleep
[ 74.448146] [ 106] 0 106 724 352 32768
0 0 sleep
[ 74.448148] [ 100] 0 100 724 352 32768
0 0 sleep
[ 74.448155] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-2,oom_memcg=/a,task_memcg=/a,task=tail,pid=107,uid0
[ 74.448178] Memory cgroup out of memory: Killed process 107 (tail)
total-vm:11928kB, anon-rss:9088kB, file-rss:0kB, shmem-rss:1280kB,
UID:00

2023-04-27 14:09:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: dump memory.stat during cgroup OOM for v1

On Thu 27-04-23 02:21:30, Yosry Ahmed wrote:
> On Wed, Apr 26, 2023 at 8:27 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 26-04-23 13:39:19, Yosry Ahmed wrote:
> > > Commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup
> > > OOM") made sure we dump all the stats in memory.stat during a cgroup
> > > OOM, but it also introduced a slight behavioral change. The code used to
> > > print the non-hierarchical v1 cgroup stats for the entire cgroup
> > > subtree, not it only prints the v2 cgroup stats for the cgroup under
> > > OOM.
> > >
> > > Although v2 stats are a superset of v1 stats, some of them have
> > > different naming. We also lost the non-hierarchical stats for the cgroup
> > > under OOM in v1.
> >
> > Why is that a problem worth solving? It would be also nice to add an
> > example of the oom report before and after the patch.
> > --
> > Michal Hocko
> > SUSE Labs
>
> Thanks for taking a look!
>
> The problem is that when upgrading to a kernel that contains
> c8713d0b2312 on cgroup v1, the OOM logs suddenly change. The stats
> names become different, a couple of stats are gone, and the
> non-hierarchical stats disappear.
>
> The non-hierarchical stats are important to identify if a memcg OOM'd
> because of the memory consumption of its own processes or its
> descendants. In the example below, I created a parent memcg "a", and a
> child memcg "b". A process in "a" itself ("tail" in this case) is
> hogging memory and causing an OOM, not the processes in the child "b"
> (the "sleep" processes). With non-hierarchical stats, it's clear that
> this is the case.

Is this difference really important from the OOM POV. There is no group
oom semantic in v1 and so it always boils down to a specific process
that gets selected. Which memcg it is sitting in shouldn't matter all
that much. Or does it really matter?

> Also, it is generally nice to keep things consistent as much as
> possible. The sudden change of the OOM log with the kernel upgrade is
> confusing, especially that the memcg stats in the OOM logs in cgroup
> v1 now look different from the stats in memory.stat.

Generally speaking oom report is not carved into stone. While we
shouldn't make changes just nilly willy it might change for
implementation specific reasons.

In this particular case I would agree that the new output is more
confusing than helpful. Just look at
> [ 88.339505] pgscan 0
> [ 88.339505] pgsteal 0
> [ 88.339506] pgscan_kswapd 0
> [ 88.339506] pgscan_direct 0
> [ 88.339507] pgscan_khugepaged 0
> [ 88.339507] pgsteal_kswapd 0
> [ 88.339508] pgsteal_direct 0
> [ 88.339508] pgsteal_khugepaged 0

These stats are actively misleading because it would suggest there was
no memory reclaim done before oom was hit and that would imply a
potentially premature OOM killer invocation (thus a bug). There are
likely other stats which are not tracked in v1 yet they are reported
that might add to the confusion. I believe this would be a sound
justification to get back to the original reporting.
--
Michal Hocko
SUSE Labs

2023-04-27 22:18:29

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: dump memory.stat during cgroup OOM for v1

On Thu, Apr 27, 2023 at 7:06 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 27-04-23 02:21:30, Yosry Ahmed wrote:
> > On Wed, Apr 26, 2023 at 8:27 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Wed 26-04-23 13:39:19, Yosry Ahmed wrote:
> > > > Commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup
> > > > OOM") made sure we dump all the stats in memory.stat during a cgroup
> > > > OOM, but it also introduced a slight behavioral change. The code used to
> > > > print the non-hierarchical v1 cgroup stats for the entire cgroup
> > > > subtree, not it only prints the v2 cgroup stats for the cgroup under
> > > > OOM.
> > > >
> > > > Although v2 stats are a superset of v1 stats, some of them have
> > > > different naming. We also lost the non-hierarchical stats for the cgroup
> > > > under OOM in v1.
> > >
> > > Why is that a problem worth solving? It would be also nice to add an
> > > example of the oom report before and after the patch.
> > > --
> > > Michal Hocko
> > > SUSE Labs
> >
> > Thanks for taking a look!
> >
> > The problem is that when upgrading to a kernel that contains
> > c8713d0b2312 on cgroup v1, the OOM logs suddenly change. The stats
> > names become different, a couple of stats are gone, and the
> > non-hierarchical stats disappear.
> >
> > The non-hierarchical stats are important to identify if a memcg OOM'd
> > because of the memory consumption of its own processes or its
> > descendants. In the example below, I created a parent memcg "a", and a
> > child memcg "b". A process in "a" itself ("tail" in this case) is
> > hogging memory and causing an OOM, not the processes in the child "b"
> > (the "sleep" processes). With non-hierarchical stats, it's clear that
> > this is the case.
>
> Is this difference really important from the OOM POV. There is no group
> oom semantic in v1 and so it always boils down to a specific process
> that gets selected. Which memcg it is sitting in shouldn't matter all
> that much. Or does it really matter?

It is useful information that we already have and used to dump. It's
not really about which memcg the victim lived in, but which memcg is
not behaving as expected causing an OOM. If you have processes running
in the OOMing memcg itself as well as its descendants, then it's nice
to get an idea of which memcg's usage is not as expected.

>
> > Also, it is generally nice to keep things consistent as much as
> > possible. The sudden change of the OOM log with the kernel upgrade is
> > confusing, especially that the memcg stats in the OOM logs in cgroup
> > v1 now look different from the stats in memory.stat.
>
> Generally speaking oom report is not carved into stone. While we
> shouldn't make changes just nilly willy it might change for
> implementation specific reasons.

Agreed, but for this case there is really no reason to change the
behavior, and imo making restoring the behavior makes the code cleaner
/ more consistent.

>
> In this particular case I would agree that the new output is more
> confusing than helpful. Just look at
> > [ 88.339505] pgscan 0
> > [ 88.339505] pgsteal 0
> > [ 88.339506] pgscan_kswapd 0
> > [ 88.339506] pgscan_direct 0
> > [ 88.339507] pgscan_khugepaged 0
> > [ 88.339507] pgsteal_kswapd 0
> > [ 88.339508] pgsteal_direct 0
> > [ 88.339508] pgsteal_khugepaged 0
>
> These stats are actively misleading because it would suggest there was
> no memory reclaim done before oom was hit and that would imply a
> potentially premature OOM killer invocation (thus a bug). There are
> likely other stats which are not tracked in v1 yet they are reported
> that might add to the confusion. I believe this would be a sound
> justification to get back to the original reporting.

I don't think the collection of these stats is v2-specific, I was just
testing in a VM without swap, and the process had 100% anonymous
memory. Testing with swap gives sane stats:
[ 68.851724] pgscan 248801
[ 68.851725] pgsteal 123081
[ 68.851725] pgscan_kswapd 0
[ 68.851726] pgscan_direct 248801
[ 68.851727] pgscan_khugepaged 0
[ 68.851727] pgsteal_kswapd 0
[ 68.851728] pgsteal_direct 123081
[ 68.851728] pgsteal_khugepaged 0

However, I still think this change is valuable. Like you mentioned,
the OOM log is not set in stone, but we shouldn't just change it for
no reason. In this case, for cgroup v1 users, the OOM log changed for
no reason beyond a side effect of another patch. Upon upgrading our
kernel we noticed the behavior change. This patch restores the old
behavior without any cost really, and it makes the code a tiny bit
more consistent.

> --
> Michal Hocko
> SUSE Labs

2023-04-28 09:48:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: dump memory.stat during cgroup OOM for v1

On Thu 27-04-23 15:12:46, Yosry Ahmed wrote:
[...]
> However, I still think this change is valuable. Like you mentioned,
> the OOM log is not set in stone, but we shouldn't just change it for
> no reason. In this case, for cgroup v1 users, the OOM log changed for
> no reason beyond a side effect of another patch. Upon upgrading our
> kernel we noticed the behavior change. This patch restores the old
> behavior without any cost really, and it makes the code a tiny bit
> more consistent.

Fair enough. Just make sure you go into more details about why this is
causing problems/inconveniences. I am slightly worried this might cause
problems to other people who would like to have the same report for both
v1 and v2 so we should at least have some solid argumetns to revert
rather than "it used has changed and we liked it more that way".

I personally do not care all that much. It kinda sucks to dump counters
that are not tracked or fully tracked in v1 because that can mislead
people and that would be a bigger problem from my POV.
--
Michal Hocko
SUSE Labs

2023-04-28 13:10:05

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: dump memory.stat during cgroup OOM for v1

On Fri, Apr 28, 2023 at 2:44 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 27-04-23 15:12:46, Yosry Ahmed wrote:
> [...]
> > However, I still think this change is valuable. Like you mentioned,
> > the OOM log is not set in stone, but we shouldn't just change it for
> > no reason. In this case, for cgroup v1 users, the OOM log changed for
> > no reason beyond a side effect of another patch. Upon upgrading our
> > kernel we noticed the behavior change. This patch restores the old
> > behavior without any cost really, and it makes the code a tiny bit
> > more consistent.
>
> Fair enough. Just make sure you go into more details about why this is
> causing problems/inconveniences. I am slightly worried this might cause
> problems to other people who would like to have the same report for both
> v1 and v2 so we should at least have some solid argumetns to revert
> rather than "it used has changed and we liked it more that way".
>
> I personally do not care all that much. It kinda sucks to dump counters
> that are not tracked or fully tracked in v1 because that can mislead
> people and that would be a bigger problem from my POV.

Great point, let me send a v2 rephrasing the commit log of this patch
and adding the Ack's on the first one.

Thanks Michal!

> --
> Michal Hocko
> SUSE Labs