2017-09-14 17:15:45

by Yang Shi

[permalink] [raw]
Subject: [RFC] oom: capture unreclaimable slab info in oom message when kernel panic


Recently we ran into a oom issue, kernel panic due to no killable process.
The dmesg shows huge unreclaimable slabs used almost 100% memory, but kdump
doesn't capture vmcore due to some reason.

So, it may sound better to capture unreclaimable slab info in oom message when
kernel panic to aid trouble shooting and cover the corner case.
Since kernel already panic, so capturing more information sounds worthy and
doesn't bother normal oom killer.

With the patchset, /proc/slabinfo can print an extra column for reclaimable
flag and tools/vm/slabinfo has a new option, "-U", to show unreclaimable
slab only.

And, oom will print all non zero (num_objs * size != 0) unreclaimable slabs in
oom killer message.

For details, please see the commit log for each commit.

Yang Shi (3):
mm: slab: output reclaimable flag in /proc/slabinfo
tools: slabinfo: add "-U" option to show unreclaimable slabs only
mm: oom: show unreclaimable slab info when kernel panic

mm/oom_kill.c | 13 +++++++++++--
mm/slab.c | 1 +
mm/slab.h | 7 +++++++
mm/slab_common.c | 27 +++++++++++++++++++++++++++
mm/slub.c | 1 +
tools/vm/slabinfo.c | 11 ++++++++++-
6 files changed, 57 insertions(+), 3 deletions(-)


2017-09-14 17:15:21

by Yang Shi

[permalink] [raw]
Subject: [PATCH 3/3] mm: oom: show unreclaimable slab info when kernel panic

Kernel may panic when oom happens without killable process sometimes it
is
caused by huge unreclaimable slabs used by kernel.

Altough kdump could help debug such problem, however, kdump is not
available on all architectures and it might be malfunction sometime.
And, since kernel already panic it is worthy capturing such information
in dmesg to aid touble shooting.

Print out unreclaimable slab info which actual memory usage is not zero
(num_objs * size != 0) when panic_on_oom is set or no killable process.
Since such information is just showed when kernel panic, so it will not
lead too verbose message for normal oom.

The output looks like:

rpc_buffers 31KB
rpc_tasks 31KB
avtab_node 46735KB
xfs_buf 624KB
xfs_ili 48KB
xfs_efi_item 31KB
xfs_efd_item 31KB
xfs_buf_item 78KB
xfs_log_item_desc 141KB
xfs_trans 108KB
xfs_ifork 744KB
xfs_trans 108KB
xfs_ifork 744KB
xfs_da_state 126KB

Signed-off-by: Yang Shi <[email protected]>
---
mm/oom_kill.c | 13 +++++++++++--
mm/slab.h | 1 +
mm/slab_common.c | 25 +++++++++++++++++++++++++
3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 99736e0..173c423 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -43,6 +43,7 @@

#include <asm/tlb.h>
#include "internal.h"
+#include "slab.h"

#define CREATE_TRACE_POINTS
#include <trace/events/oom.h>
@@ -427,6 +428,14 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
dump_tasks(oc->memcg, oc->nodemask);
}

+static void dump_header_with_slabinfo(struct oom_control *oc, struct task_struct *p)
+{
+ dump_header(oc, p);
+
+ if (IS_ENABLED(CONFIG_SLABINFO))
+ show_unreclaimable_slab();
+}
+
/*
* Number of OOM victims in flight
*/
@@ -959,7 +968,7 @@ static void check_panic_on_oom(struct oom_control *oc,
/* Do not panic for oom kills triggered by sysrq */
if (is_sysrq_oom(oc))
return;
- dump_header(oc, NULL);
+ dump_header_with_slabinfo(oc, NULL);
panic("Out of memory: %s panic_on_oom is enabled\n",
sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
}
@@ -1043,7 +1052,7 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
- dump_header(oc, NULL);
+ dump_header_with_slabinfo(oc, NULL);
panic("Out of memory and no killable processes...\n");
}
if (oc->chosen && oc->chosen != (void *)-1UL) {
diff --git a/mm/slab.h b/mm/slab.h
index cf01a6e..2f1ebce 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -510,6 +510,7 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
void memcg_slab_stop(struct seq_file *m, void *p);
int memcg_slab_show(struct seq_file *m, void *p);
+void show_unreclaimable_slab(void);

void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr);

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8a55730..42cd32a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -35,6 +35,8 @@
static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
slab_caches_to_rcu_destroy_workfn);

+#define K(x) ((x)/1024)
+
/*
* Set of flags that will prevent slab merging
*/
@@ -1274,6 +1276,29 @@ static int slab_show(struct seq_file *m, void *p)
return 0;
}

+void show_unreclaimable_slab()
+{
+ struct kmem_cache *s = NULL;
+ struct slabinfo sinfo;
+
+ memset(&sinfo, 0, sizeof(sinfo));
+
+ printk("Unreclaimable slabs:\n");
+ mutex_lock(&slab_mutex);
+ list_for_each_entry(s, &slab_caches, list) {
+ if (!is_root_cache(s))
+ continue;
+
+ get_slabinfo(s, &sinfo);
+
+ if (!is_reclaimable(s) && sinfo.num_objs > 0)
+ printk("%-17s %luKB\n", cache_name(s), K(sinfo.num_objs * s->size));
+ }
+ mutex_unlock(&slab_mutex);
+}
+EXPORT_SYMBOL(show_unreclaimable_slab);
+#undef K
+
#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
void *memcg_slab_start(struct seq_file *m, loff_t *pos)
{
--
1.8.3.1

2017-09-14 17:15:20

by Yang Shi

[permalink] [raw]
Subject: [PATCH 2/3] tools: slabinfo: add "-U" option to show unreclaimable slabs only

Add "-U" option to show unreclaimable slabs only.

"-U" and "-S" together can tell us what unreclaimable slabs use the most
memory to help debug huge unreclaimable slabs issue.

Signed-off-by: Yang Shi <[email protected]>
---
tools/vm/slabinfo.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index b9d34b3..9673190 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -83,6 +83,7 @@ struct aliasinfo {
int sort_loss;
int extended_totals;
int show_bytes;
+int unreclaim_only;

/* Debug options */
int sanity;
@@ -132,6 +133,7 @@ static void usage(void)
"-L|--Loss Sort by loss\n"
"-X|--Xtotals Show extended summary information\n"
"-B|--Bytes Show size in bytes\n"
+ "-U|--unreclaim Show unreclaimable slabs only\n"
"\nValid debug options (FZPUT may be combined)\n"
"a / A Switch on all debug options (=FZUP)\n"
"- Switch off all debug options\n"
@@ -568,6 +570,9 @@ static void slabcache(struct slabinfo *s)
if (strcmp(s->name, "*") == 0)
return;

+ if (unreclaim_only && s->reclaim_account)
+ return;
+
if (actual_slabs == 1) {
report(s);
return;
@@ -1346,6 +1351,7 @@ struct option opts[] = {
{ "Loss", no_argument, NULL, 'L'},
{ "Xtotals", no_argument, NULL, 'X'},
{ "Bytes", no_argument, NULL, 'B'},
+ { "unreclaim", no_argument, NULL, 'U'},
{ NULL, 0, NULL, 0 }
};

@@ -1357,7 +1363,7 @@ int main(int argc, char *argv[])

page_size = getpagesize();

- while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTSN:LXB",
+ while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTSN:LXBU",
opts, NULL)) != -1)
switch (c) {
case '1':
@@ -1438,6 +1444,9 @@ int main(int argc, char *argv[])
case 'B':
show_bytes = 1;
break;
+ case 'U':
+ unreclaim_only = 1;
+ break;
default:
fatal("%s: Invalid option '%c'\n", argv[0], optopt);

--
1.8.3.1

2017-09-14 17:15:46

by Yang Shi

[permalink] [raw]
Subject: [PATCH 1/3] mm: slab: output reclaimable flag in /proc/slabinfo

Although slabinfo in tools can print out the flag of slabs to show which
one is reclaimable, it sounds nice to have reclaimable flag shows in
/proc/slabinfo too since /proc should be still the first place to check
those slab info.

Add a new column called "reclaim" in /proc/slabinfo, "1" means
reclaimable, "0" means unreclaimable.

Signed-off-by: Yang Shi <[email protected]>
---
mm/slab.c | 1 +
mm/slab.h | 6 ++++++
mm/slab_common.c | 2 ++
mm/slub.c | 1 +
4 files changed, 10 insertions(+)

diff --git a/mm/slab.c b/mm/slab.c
index 04dec48..4f4971c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -4132,6 +4132,7 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo)
sinfo->shared = cachep->shared;
sinfo->objects_per_slab = cachep->num;
sinfo->cache_order = cachep->gfporder;
+ sinfo->reclaim = is_reclaimable(cachep);
}

void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index 0733628..cf01a6e 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -186,6 +186,7 @@ struct slabinfo {
unsigned int shared;
unsigned int objects_per_slab;
unsigned int cache_order;
+ unsigned int reclaim;
};

void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo);
@@ -352,6 +353,11 @@ static inline void memcg_link_cache(struct kmem_cache *s)

#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

+static inline bool is_reclaimable(struct kmem_cache *s)
+{
+ return (s->flags & SLAB_RECLAIM_ACCOUNT) ? true : false;
+}
+
static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
{
struct kmem_cache *cachep;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 904a83b..8a55730 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1201,6 +1201,7 @@ static void print_slabinfo_header(struct seq_file *m)
seq_puts(m, " : globalstat <listallocs> <maxobjs> <grown> <reaped> <error> <maxfreeable> <nodeallocs> <remotefrees> <alienoverflow>");
seq_puts(m, " : cpustat <allochit> <allocmiss> <freehit> <freemiss>");
#endif
+ seq_puts(m, " : reclaim");
seq_putc(m, '\n');
}

@@ -1259,6 +1260,7 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
seq_printf(m, " : slabdata %6lu %6lu %6lu",
sinfo.active_slabs, sinfo.num_slabs, sinfo.shared_avail);
slabinfo_show_stats(m, s);
+ seq_printf(m, " : %u", sinfo.reclaim);
seq_putc(m, '\n');
}

diff --git a/mm/slub.c b/mm/slub.c
index d39a5d3..c8526c0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5872,6 +5872,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
sinfo->num_slabs = nr_slabs;
sinfo->objects_per_slab = oo_objects(s->oo);
sinfo->cache_order = oo_order(s->oo);
+ sinfo->reclaim = is_reclaimable(s);
}

void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s)
--
1.8.3.1

Subject: Re: [PATCH 1/3] mm: slab: output reclaimable flag in /proc/slabinfo

Well /proc/slabinfo is a legacy interface. The infomation if a slab is
reclaimable is available via the slabinfo tool. We would break a format
that is relied upon by numerous tools.


Subject: Re: [PATCH 3/3] mm: oom: show unreclaimable slab info when kernel panic

I am not sure that this is generally useful at OOM times unless this is
not a rare occurrence.

Certainly information like that would create more support for making
objects movable.


2017-09-14 17:49:11

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: oom: show unreclaimable slab info when kernel panic



On 9/14/17 10:32 AM, Christopher Lameter wrote:
> I am not sure that this is generally useful at OOM times unless this is
> not a rare occurrence.

I would say it is not very rare. But, it is definitely troublesome to
narrow down without certain information about slab usage when it happens.

Thanks,
Yang

>
> Certainly information like that would create more support for making
> objects movable
>

2017-09-14 17:53:30

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: slab: output reclaimable flag in /proc/slabinfo



On 9/14/17 10:27 AM, Christopher Lameter wrote:
> Well /proc/slabinfo is a legacy interface. The infomation if a slab is
> reclaimable is available via the slabinfo tool. We would break a format
> that is relied upon by numerous tools.

Thanks for pointing this out. It would be unacceptable if it would break
the backward compatibility.

A follow-up question is do we know what tools rely on the slabinfo format?

From my point of view, although /proc/slabinfo is legacy, it sounds it
is still used very often by the users.

Thanks,
Yang

>

2017-09-15 12:00:50

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: oom: show unreclaimable slab info when kernel panic

On 2017/09/15 2:14, Yang Shi wrote:
> @@ -1274,6 +1276,29 @@ static int slab_show(struct seq_file *m, void *p)
> return 0;
> }
>
> +void show_unreclaimable_slab()
> +{
> + struct kmem_cache *s = NULL;
> + struct slabinfo sinfo;
> +
> + memset(&sinfo, 0, sizeof(sinfo));
> +
> + printk("Unreclaimable slabs:\n");
> + mutex_lock(&slab_mutex);

Please avoid sleeping locks which potentially depend on memory allocation.
There are

mutex_lock(&slab_mutex);
kmalloc(GFP_KERNEL);
mutex_unlock(&slab_mutex);

users which will fail to call panic() if they hit this path.

> + list_for_each_entry(s, &slab_caches, list) {
> + if (!is_root_cache(s))
> + continue;
> +
> + get_slabinfo(s, &sinfo);
> +
> + if (!is_reclaimable(s) && sinfo.num_objs > 0)
> + printk("%-17s %luKB\n", cache_name(s), K(sinfo.num_objs * s->size));
> + }
> + mutex_unlock(&slab_mutex);
> +}
> +EXPORT_SYMBOL(show_unreclaimable_slab);
> +#undef K
> +
> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> void *memcg_slab_start(struct seq_file *m, loff_t *pos)
> {
>

2017-09-15 17:40:33

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm: oom: show unreclaimable slab info when kernel panic



On 9/15/17 5:00 AM, Tetsuo Handa wrote:
> On 2017/09/15 2:14, Yang Shi wrote:
>> @@ -1274,6 +1276,29 @@ static int slab_show(struct seq_file *m, void *p)
>> return 0;
>> }
>>
>> +void show_unreclaimable_slab()
>> +{
>> + struct kmem_cache *s = NULL;
>> + struct slabinfo sinfo;
>> +
>> + memset(&sinfo, 0, sizeof(sinfo));
>> +
>> + printk("Unreclaimable slabs:\n");
>> + mutex_lock(&slab_mutex);
>
> Please avoid sleeping locks which potentially depend on memory allocation.
> There are
>
> mutex_lock(&slab_mutex);
> kmalloc(GFP_KERNEL);
> mutex_unlock(&slab_mutex);
>
> users which will fail to call panic() if they hit this path
Thanks for the heads up. Since this is just called by oom in panic path,
so it sounds safe to just discard the mutex_lock()/mutex_unlock call
since nobody can allocate memory without GFP_ATOMIC to change the
statistics of slab.

Even though some GFP_ATOMIC callers allocate memory successfully, it
should not have obvious impact to the slabinfo we need capture since
typically GFP_ATOMIC allocation is small.

I will drop the mutext in v2 if no one has objection.

Thanks,
Yang

>
>> + list_for_each_entry(s, &slab_caches, list) {
>> + if (!is_root_cache(s))
>> + continue;
>> +
>> + get_slabinfo(s, &sinfo);
>> +
>> + if (!is_reclaimable(s) && sinfo.num_objs > 0)
>> + printk("%-17s %luKB\n", cache_name(s), K(sinfo.num_objs * s->size));
>> + }
>> + mutex_unlock(&slab_mutex);
>> +}
>> +EXPORT_SYMBOL(show_unreclaimable_slab);
>> +#undef K
>> +
>> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>> void *memcg_slab_start(struct seq_file *m, loff_t *pos)
>> {
>>