2012-02-22 11:55:09

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH] oom: add sysctl to enable slab memory dump

Adds a new sysctl, 'oom_dump_slabs', that enables the kernel to produce a
dump of all eligible system slab caches when performing an OOM-killing.
Information includes per cache active objects, total objects, object size,
cache name and cache size.

The eligibility for being reported is given by an auxiliary sysctl,
'oom_dump_slabs_ratio', which express (in percentage) the memory committed
ratio between a particular cache size and the total slab size.

This, alongside with all other data dumped in OOM events, is very helpful
information in diagnosing why there was an OOM condition specially when
kernel code is under investigation.

Signed-off-by: Rafael Aquini <[email protected]>
---
Documentation/sysctl/vm.txt | 30 ++++++++++++++++++
include/linux/oom.h | 2 +
kernel/sysctl.c | 16 +++++++++
mm/oom_kill.c | 13 ++++++++
mm/slab.c | 72 +++++++++++++++++++++++++++++++++++++++++++
mm/slub.c | 52 +++++++++++++++++++++++++++++++
6 files changed, 185 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..a0da8b3 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -498,6 +498,36 @@ this is causing problems for your system/application.

==============================================================

+oom_dump_slabs
+
+Enables a system-wide slab cache dump to be produced when the kernel
+performs an OOM-killing and includes, per slab cache, such information as
+active objects, total objects, object size, cache name, and cache size.
+This is helpful to determine the top slab cache memory users, as well as
+to identify what was their part on this OOM-killer occurrence.
+
+If this is set to zero, this information is suppressed.
+
+If this is set to non-zero, this information is shown whenever the
+OOM killer actually kills a memory-hogging task.
+
+The default value is 1 (enabled).
+
+==============================================================
+
+oom_dump_slabs_ratio
+
+Adjust, as a percentage of total system memory dedicated to the slab cache,
+a per cache size cutting point for oom_dump_slabs reports. If this is set to
+a non-zero 'N', only caches bigger than N% of total memory committed to slab
+will be dumped out when OOM-killer is invoked.
+
+When set to 0, all slab caches will be listed at dump report.
+
+The default value is 10.
+
+==============================================================
+
oom_dump_tasks

Enables a system-wide task dump (excluding kernel threads) to be
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 552fba9..8af3863 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -69,6 +69,8 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p);

/* sysctls */
extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_dump_slabs;
+extern int sysctl_oom_dump_slabs_ratio;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_panic_on_oom;
#endif /* __KERNEL__*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f487f25..a9da3d8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1040,6 +1040,22 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#if defined(CONFIG_SLAB) || defined(CONFIG_SLUB)
+ {
+ .procname = "oom_dump_slabs",
+ .data = &sysctl_oom_dump_slabs,
+ .maxlen = sizeof(sysctl_oom_dump_slabs),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "oom_dump_slabs_ratio",
+ .data = &sysctl_oom_dump_slabs_ratio,
+ .maxlen = sizeof(sysctl_oom_dump_slabs_ratio),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
{
.procname = "overcommit_ratio",
.data = &sysctl_overcommit_ratio,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2958fd8..4de948d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,9 +38,20 @@
#define CREATE_TRACE_POINTS
#include <trace/events/oom.h>

+#if defined(CONFIG_SLAB) || defined(CONFIG_SLUB)
+extern void oom_dump_slabs(int ratio);
+#else
+static void oom_dump_slabs(int ratio)
+{
+}
+#endif
+
+
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
+int sysctl_oom_dump_slabs = 1;
+int sysctl_oom_dump_slabs_ratio = 10;
static DEFINE_SPINLOCK(zone_scan_lock);

/*
@@ -429,6 +440,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
dump_stack();
mem_cgroup_print_oom_info(memcg, p);
show_mem(SHOW_MEM_FILTER_NODES);
+ if (sysctl_oom_dump_slabs)
+ oom_dump_slabs(sysctl_oom_dump_slabs_ratio);
if (sysctl_oom_dump_tasks)
dump_tasks(memcg, nodemask);
}
diff --git a/mm/slab.c b/mm/slab.c
index f0bd785..c2b5d14 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -4629,3 +4629,75 @@ size_t ksize(const void *objp)
return obj_size(virt_to_cache(objp));
}
EXPORT_SYMBOL(ksize);
+
+/**
+ * oom_dump_slabs - dump top slab cache users
+ * @ratio: memory committed ratio between a cache size and the total slab size
+ *
+ * Dumps the current memory state of all eligible slab caches.
+ * State information includes cache's active objects, total objects,
+ * object size, cache name, and cache size.
+ */
+void oom_dump_slabs(int ratio)
+{
+ struct kmem_cache *cachep;
+ struct kmem_list3 *l3;
+ struct slab *slabp;
+ unsigned long active_objs, num_objs, free_objects, cache_size;
+ unsigned long active_slabs, num_slabs, slab_total_mem;
+ int node;
+
+ slab_total_mem = (global_page_state(NR_SLAB_RECLAIMABLE) +
+ global_page_state(NR_SLAB_UNRECLAIMABLE)) << PAGE_SHIFT;
+
+ if (ratio < 0)
+ ratio = 0;
+
+ if (ratio > 100)
+ ratio = 100;
+
+ pr_info("--- oom_dump_slabs:\n");
+ pr_info("<active_objs> <num_objs> <objsize> <cache_name>\n");
+ mutex_lock(&cache_chain_mutex);
+ list_for_each_entry(cachep, &cache_chain, next) {
+ num_objs = 0;
+ num_slabs = 0;
+ active_objs = 0;
+ free_objects = 0;
+ active_slabs = 0;
+
+ for_each_online_node(node) {
+ l3 = cachep->nodelists[node];
+ if (!l3)
+ continue;
+
+ check_irq_on();
+ spin_lock_irq(&l3->list_lock);
+
+ list_for_each_entry(slabp, &l3->slabs_full, list) {
+ active_objs += cachep->num;
+ active_slabs++;
+ }
+ list_for_each_entry(slabp, &l3->slabs_partial, list) {
+ active_objs += slabp->inuse;
+ active_slabs++;
+ }
+ list_for_each_entry(slabp, &l3->slabs_free, list)
+ num_slabs++;
+
+ free_objects += l3->free_objects;
+ spin_unlock_irq(&l3->list_lock);
+ }
+ num_slabs += active_slabs;
+ num_objs = num_slabs * cachep->num;
+ cache_size = (cachep->buffer_size * num_objs);
+
+ if (cache_size >= (slab_total_mem * ratio / 100))
+ pr_info("%12lu %12lu %12u %-20s : %9lu kB\n",
+ active_objs, num_objs, cachep->buffer_size,
+ cachep->name, cache_size >> 10);
+ }
+ mutex_unlock(&cache_chain_mutex);
+ pr_info("---\n");
+}
+EXPORT_SYMBOL(oom_dump_slabs);
diff --git a/mm/slub.c b/mm/slub.c
index 4907563..7719f92 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5480,3 +5480,55 @@ static int __init slab_proc_init(void)
}
module_init(slab_proc_init);
#endif /* CONFIG_SLABINFO */
+
+/**
+ * oom_dump_slabs - dump top slab cache users
+ * @ratio: memory committed ratio between a cache size and the total slab size
+ *
+ * Dumps the current memory state of all eligible slab caches.
+ * State information includes cache's active objects, total objects,
+ * object size, cache name, and cache size.
+ */
+void oom_dump_slabs(int ratio)
+{
+ unsigned long cache_size, slab_total_mem;
+ unsigned long nr_objs, nr_free, nr_inuse;
+ struct kmem_cache *cachep;
+ int node;
+
+ slab_total_mem = (global_page_state(NR_SLAB_RECLAIMABLE) +
+ global_page_state(NR_SLAB_UNRECLAIMABLE)) << PAGE_SHIFT;
+
+ if (ratio < 0)
+ ratio = 0;
+
+ if (ratio > 100)
+ ratio = 100;
+
+ pr_info("--- oom_dump_slabs:\n");
+ pr_info("<active_objs> <num_objs> <objsize> <cache_name>\n");
+ down_read(&slub_lock);
+ list_for_each_entry(cachep, &slab_caches, list) {
+ nr_objs = 0;
+ nr_free = 0;
+
+ for_each_online_node(node) {
+ struct kmem_cache_node *n = get_node(cachep, node);
+ if (!n)
+ continue;
+
+ nr_objs += atomic_long_read(&n->total_objects);
+ nr_free += count_partial(n, count_free);
+ }
+ nr_inuse = nr_objs - nr_free;
+ cache_size = (cachep->size * nr_objs);
+
+ if (cache_size >= (slab_total_mem * ratio / 100))
+ pr_info("%12lu %12lu %12u %-20s : %9lu kB\n",
+ nr_inuse, nr_objs, cachep->size,
+ cachep->name, cache_size >> 10);
+ }
+ up_read(&slub_lock);
+ pr_info("---\n");
+}
+EXPORT_SYMBOL(oom_dump_slabs);
--
1.7.7.6


2012-02-22 13:17:40

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Wed, Feb 22, 2012 at 1:53 PM, Rafael Aquini <[email protected]> wrote:
> Adds a new sysctl, 'oom_dump_slabs', that enables the kernel to produce a
> dump of all eligible system slab caches when performing an OOM-killing.
> Information includes per cache active objects, total objects, object size,
> cache name and cache size.
>
> The eligibility for being reported is given by an auxiliary sysctl,
> 'oom_dump_slabs_ratio', which express (in percentage) the memory committed
> ratio between a particular cache size and the total slab size.
>
> This, alongside with all other data dumped in OOM events, is very helpful
> information in diagnosing why there was an OOM condition specially when
> kernel code is under investigation.
>
> Signed-off-by: Rafael Aquini <[email protected]>

Makes sense. Do you have an example how an out-of-memory slab cache
dump looks like?

> diff --git a/mm/slab.c b/mm/slab.c
> index f0bd785..c2b5d14 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -4629,3 +4629,75 @@ size_t ksize(const void *objp)
> ? ? ? ?return obj_size(virt_to_cache(objp));
> ?}
> ?EXPORT_SYMBOL(ksize);
> +
> +/**
> + * oom_dump_slabs - dump top slab cache users
> + * @ratio: memory committed ratio between a cache size and the total slab size
> + *
> + * Dumps the current memory state of all eligible slab caches.
> + * State information includes cache's active objects, total objects,
> + * object size, cache name, and cache size.
> + */
> +void oom_dump_slabs(int ratio)
> +{
> + ? ? ? struct kmem_cache *cachep;
> + ? ? ? struct kmem_list3 *l3;
> + ? ? ? struct slab *slabp;
> + ? ? ? unsigned long active_objs, num_objs, free_objects, cache_size;
> + ? ? ? unsigned long active_slabs, num_slabs, slab_total_mem;
> + ? ? ? int node;

[snip]

> + ? ? ? list_for_each_entry(cachep, &cache_chain, next) {
> + ? ? ? ? ? ? ? num_objs = 0;
> + ? ? ? ? ? ? ? num_slabs = 0;
> + ? ? ? ? ? ? ? active_objs = 0;
> + ? ? ? ? ? ? ? free_objects = 0;
> + ? ? ? ? ? ? ? active_slabs = 0;

Minor style nit: just define the zeroed variables in this block.

> +
> + ? ? ? ? ? ? ? for_each_online_node(node) {
> + ? ? ? ? ? ? ? ? ? ? ? l3 = cachep->nodelists[node];
> + ? ? ? ? ? ? ? ? ? ? ? if (!l3)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> +void oom_dump_slabs(int ratio)
> +{
> + ? ? ? unsigned long cache_size, slab_total_mem;
> + ? ? ? unsigned long nr_objs, nr_free, nr_inuse;
> + ? ? ? struct kmem_cache *cachep;
> + ? ? ? int node;
> +
> + ? ? ? slab_total_mem = (global_page_state(NR_SLAB_RECLAIMABLE) +
> + ? ? ? ? ? ? ? ? ? ? ? global_page_state(NR_SLAB_UNRECLAIMABLE)) << PAGE_SHIFT;
> +
> + ? ? ? if (ratio < 0)
> + ? ? ? ? ? ? ? ratio = 0;
> +
> + ? ? ? if (ratio > 100)
> + ? ? ? ? ? ? ? ratio = 100;
> +
> + ? ? ? pr_info("--- oom_dump_slabs:\n");
> + ? ? ? pr_info("<active_objs> ? ?<num_objs> ? ? <objsize> ?<cache_name>\n");
> + ? ? ? down_read(&slub_lock);
> + ? ? ? list_for_each_entry(cachep, &slab_caches, list) {
> + ? ? ? ? ? ? ? nr_objs = 0;
> + ? ? ? ? ? ? ? nr_free = 0;

ditto.

> +
> + ? ? ? ? ? ? ? for_each_online_node(node) {
> + ? ? ? ? ? ? ? ? ? ? ? struct kmem_cache_node *n = get_node(cachep, node);
> + ? ? ? ? ? ? ? ? ? ? ? if (!n)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> +

Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Wed, 22 Feb 2012, Rafael Aquini wrote:

> --- a/mm/slub.c
> +++ b/mm/slub.c
> +void oom_dump_slabs(int ratio)
> +{

> +
> + for_each_online_node(node) {
> + struct kmem_cache_node *n = get_node(cachep, node);
> + if (!n)
> + continue;
> +
> + nr_objs += atomic_long_read(&n->total_objects);

Please use node_nr_objects() instead of directly accessing total_objects.
total_objects are only available if debugging support was compiled in.

2012-02-22 16:06:14

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Wed, Feb 22, 2012 at 03:17:23PM +0200, Pekka Enberg wrote:
> On Wed, Feb 22, 2012 at 1:53 PM, Rafael Aquini <[email protected]> wrote:
> > This, alongside with all other data dumped in OOM events, is very helpful
> > information in diagnosing why there was an OOM condition specially when
> > kernel code is under investigation.
> >
> > Signed-off-by: Rafael Aquini <[email protected]>
>
> Makes sense. Do you have an example how an out-of-memory slab cache
> dump looks like?
>
Yes I do have a couple of dumps taken from some dabbler testing I was
performing. Would it be interesting to introduce a sample at the commit message, or
among the Documentation paragraphs?


> Minor style nit: just define the zeroed variables in this block.
>
I will adjust those.

Thanks for your feedback!
--
Rafael Aquini <[email protected]>
Software Maintenance Engineer
Red Hat, Inc.
+55 51 4063.9436 / 8426138 (ext)

2012-02-22 16:16:24

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Wed, Feb 22, 2012 at 07:55:16AM -0600, Christoph Lameter wrote:
>
> Please use node_nr_objects() instead of directly accessing total_objects.
> total_objects are only available if debugging support was compiled in.
>
Shame on me! I've wrongly assumed that it would be safe accessing
the element because SLUB_DEBUG is turned on by default when slub is chosen.

Considering your note on my previous mistake, shall I assume now that it
would be better having this whole dump feature dependable on CONFIG_SLUB_DEBUG,
instead of just CONFIG_SLUB ?

Thanks for your feedback!
--
Rafael Aquini <[email protected]>
Software Maintenance Engineer
Red Hat, Inc.
+55 51 4063.9436 / 8426138 (ext)

Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Wed, 22 Feb 2012, Rafael Aquini wrote:

> On Wed, Feb 22, 2012 at 07:55:16AM -0600, Christoph Lameter wrote:
> >
> > Please use node_nr_objects() instead of directly accessing total_objects.
> > total_objects are only available if debugging support was compiled in.
> >
> Shame on me! I've wrongly assumed that it would be safe accessing
> the element because SLUB_DEBUG is turned on by default when slub is chosen.
>
> Considering your note on my previous mistake, shall I assume now that it
> would be better having this whole dump feature dependable on CONFIG_SLUB_DEBUG,
> instead of just CONFIG_SLUB ?

That is certainly one solution. If CONFIG_SLUB_DEBUG is not set then
support for maintaining a total count is not compiled in. You can of
course still approximate that from the total number of slabs allocated and
multiply that number by the # of objs per slab page.

2012-02-23 00:45:04

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Wed, 22 Feb 2012, Rafael Aquini wrote:

> Adds a new sysctl, 'oom_dump_slabs', that enables the kernel to produce a
> dump of all eligible system slab caches when performing an OOM-killing.
> Information includes per cache active objects, total objects, object size,
> cache name and cache size.
>
> The eligibility for being reported is given by an auxiliary sysctl,
> 'oom_dump_slabs_ratio', which express (in percentage) the memory committed
> ratio between a particular cache size and the total slab size.
>
> This, alongside with all other data dumped in OOM events, is very helpful
> information in diagnosing why there was an OOM condition specially when
> kernel code is under investigation.
>

I don't like this because it duplicates what is given by /proc/slabinfo
that can easily be read at the time of oom and is unnecessary to dump to
the kernel log. We display the meminfo (which includes the amount of
slab, just not broken down by cache) because it's absolutely necessary to
understand why the oom was triggered. The tasklist dump is allowed
because it's difficult to attain all that information easily and to
determine which threads are eligible in the oom context (global, memcg,
cpuset, mempolicy) so they matter to the oom condition. The per-cache
slabinfo fits neither of that criteria and just duplicates code in the
slab allocators that is attainable elsewhere.

I think this also gives another usecase for a possible /dev/mem_notify in
the future: userspace could easily poll on an eventfd and wait for an oom
to occur and then cat /proc/slabinfo to attain all this. In other words,
if we had this functionality (which I think we undoubtedly will in the
future), this patch would be obsoleted.

2012-02-23 15:02:49

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Wed, Feb 22, 2012 at 04:44:58PM -0800, David Rientjes wrote:
> On Wed, 22 Feb 2012, Rafael Aquini wrote:
>
> > Adds a new sysctl, 'oom_dump_slabs', that enables the kernel to produce a
> > dump of all eligible system slab caches when performing an OOM-killing.
> > Information includes per cache active objects, total objects, object size,
> > cache name and cache size.
> >
> > The eligibility for being reported is given by an auxiliary sysctl,
> > 'oom_dump_slabs_ratio', which express (in percentage) the memory committed
> > ratio between a particular cache size and the total slab size.
> >
> > This, alongside with all other data dumped in OOM events, is very helpful
> > information in diagnosing why there was an OOM condition specially when
> > kernel code is under investigation.
> >
>
> I don't like this because it duplicates what is given by /proc/slabinfo
> that can easily be read at the time of oom and is unnecessary to dump to
> the kernel log. We display the meminfo (which includes the amount of
> slab, just not broken down by cache) because it's absolutely necessary to
> understand why the oom was triggered. The tasklist dump is allowed
> because it's difficult to attain all that information easily and to
> determine which threads are eligible in the oom context (global, memcg,
> cpuset, mempolicy) so they matter to the oom condition. The per-cache
> slabinfo fits neither of that criteria and just duplicates code in the
> slab allocators that is attainable elsewhere.
>

I requested this specifically because I was oom'ing the box so hard that I
couldn't read /proc/slabinfo at the time of OOM and therefore had no idea what I
was leaking. Telling me how much slab was in use was no help, I needed to know
which of the like 6 objects I was doing horrible things with was screwing me,
and without this patch I would have no way of knowing.

> I think this also gives another usecase for a possible /dev/mem_notify in
> the future: userspace could easily poll on an eventfd and wait for an oom
> to occur and then cat /proc/slabinfo to attain all this. In other words,
> if we had this functionality (which I think we undoubtedly will in the
> future), this patch would be obsoleted.

Sure, if the OOM killer doesn't kill the poller, or kill NetworkManager since
I'm remote logged into the box, or any of the other various important things
that would be required for me to get this info. Thanks,

Josef

2012-02-23 15:24:09

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Wed, Feb 22, 2012 at 04:44:58PM -0800, David Rientjes wrote:
> I don't like this because it duplicates what is given by /proc/slabinfo
> that can easily be read at the time of oom and is unnecessary to dump to
> the kernel log. We display the meminfo (which includes the amount of
> slab, just not broken down by cache) because it's absolutely necessary to
> understand why the oom was triggered. The tasklist dump is allowed
> because it's difficult to attain all that information easily and to
> determine which threads are eligible in the oom context (global, memcg,
> cpuset, mempolicy) so they matter to the oom condition. The per-cache
> slabinfo fits neither of that criteria and just duplicates code in the
> slab allocators that is attainable elsewhere.

Lets say the slab gets so bloated that for every user task spawned OOM-killer
just kills it instantly, or the system falls under severe thrashing, leaving no
chance for you getting an interactive session to parse /proc/slabinfo, thus
making the reset button as your only escape... How would you identify what was
the set of caches responsible by the slab swelling?

IMHO, having such qualified info about slab usage at hand is very useful in
several occurrences of OOM. It not only helps out developers, but also sysadmins
on troubleshooting slab usage when OOM-killer is invoked, thus qualifying and
showing such data surely does make sense for a lot of people. For those who do
not mind/care about such reporting, in the end it just takes a sysctl knob
adjustment to make it go quiet.

>
> I think this also gives another usecase for a possible /dev/mem_notify in
> the future: userspace could easily poll on an eventfd and wait for an oom
> to occur and then cat /proc/slabinfo to attain all this. In other words,
> if we had this functionality (which I think we undoubtedly will in the
> future), this patch would be obsoleted.

Great! So, why not letting the time tell us if this feature will be obsoleted
or not? I'd rather have this patch obsoleted by another one proven better, than
just stay still waiting for something that might, or might not, happen in the
future.



Thanks for your feedback!
--
Rafael Aquini <[email protected]>
Software Maintenance Engineer
Red Hat, Inc.
+55 51 4063.9436 / 8426138 (ext)

2012-02-23 16:04:00

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Thu, 2012-02-23 at 13:22 -0200, Rafael Aquini wrote:
> > I think this also gives another usecase for a possible /dev/mem_notify in
> > the future: userspace could easily poll on an eventfd and wait for an oom
> > to occur and then cat /proc/slabinfo to attain all this. In other words,
> > if we had this functionality (which I think we undoubtedly will in the
> > future), this patch would be obsoleted.
>
> Great! So, why not letting the time tell us if this feature will be obsoleted
> or not? I'd rather have this patch obsoleted by another one proven better, than
> just stay still waiting for something that might, or might not, happen in the
> future.

Sure.

I'm not really convinced such an ABI would be a full replacement for
this patch. There's certainly advantages to having all this visible in
syslog even if we'd have such a mechanism.

Pekka

2012-02-23 23:09:40

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Thu, 23 Feb 2012, Josef Bacik wrote:

> I requested this specifically because I was oom'ing the box so hard that I
> couldn't read /proc/slabinfo at the time of OOM and therefore had no idea what I
> was leaking. Telling me how much slab was in use was no help, I needed to know
> which of the like 6 objects I was doing horrible things with was screwing me,
> and without this patch I would have no way of knowing.
>

So an oom was creating a denial of service so that you had no way to do
cat /proc/slabinfo? I think we should talk about this first, because
that's a serious situation that certainly shouldn't be happening.

The oom killer is designed to kill the most memory-hogging task available
so that it doesn't have to kill multiple threads. Why was the memory not
being freed or why was the thread that was consistently being killed
restarted time and time again so you couldn't even cat a file?

> Sure, if the OOM killer doesn't kill the poller, or kill NetworkManager since
> I'm remote logged into the box, or any of the other various important things
> that would be required for me to get this info. Thanks,
>

If you're polling for oom notifications sanely, you'd probably have set

echo -1000 > /proc/pid/oom_score_adj

so it's unkillable as well as anything else you need to diagnose failures.
NetworkManager itself isn't protected like this by default, but it
shouldn't be killed unless it is leaking memory itself: we kill in the
order of the most memory usage to the least.

So neither of these are reasons to not collect /proc/slabinfo, but I'm
very interested in your follow-up to why you can't do so when "ooming the
box so hard" where you're presumably able to cat to kernel log file but
not cat /proc/slabinfo :)

2012-02-23 23:17:40

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Thu, 23 Feb 2012, Rafael Aquini wrote:

> Lets say the slab gets so bloated that for every user task spawned OOM-killer
> just kills it instantly, or the system falls under severe thrashing, leaving no
> chance for you getting an interactive session to parse /proc/slabinfo, thus
> making the reset button as your only escape... How would you identify what was
> the set of caches responsible by the slab swelling?
>

I think you misunderstand completely how the oom killer works,
unfortunately. It, by default unless you have changed oom_score_adj
tunables, kills the most memory-hogging eligible thread possible. That
certainly wouldn't be a freshly forked user task prior to execve() unless
you've enabled /proc/sys/vm/oom_kill_allocating_task, which you shouldn't
unless you're running on a machine with 1k cores, for example. It would
be existing thread that was using a lot of memory to allow for things
EXACTLY LIKE forking additional user tasks. We don't want to get into a
self-imposed DoS because something is oom and the oom killer does quite a
good job at ensuring it doesn't. The goal is to kill a single thread to
free the most amount of memory possible.

If this is what is affecting you, then you'll need to figure out why you
have changed the oom killer priority in such a way to do so: check your
/proc/pid/oom_score_adj values that you have set in a way that when they
are inherited they will instantly kill the child because it will quickly
use more memory than the parent.

> IMHO, having such qualified info about slab usage at hand is very useful in
> several occurrences of OOM. It not only helps out developers, but also sysadmins
> on troubleshooting slab usage when OOM-killer is invoked, thus qualifying and
> showing such data surely does make sense for a lot of people. For those who do
> not mind/care about such reporting, in the end it just takes a sysctl knob
> adjustment to make it go quiet.
>

cat /proc/slabinfo

> > I think this also gives another usecase for a possible /dev/mem_notify in
> > the future: userspace could easily poll on an eventfd and wait for an oom
> > to occur and then cat /proc/slabinfo to attain all this. In other words,
> > if we had this functionality (which I think we undoubtedly will in the
> > future), this patch would be obsoleted.
>
> Great! So, why not letting the time tell us if this feature will be obsoleted
> or not? I'd rather have this patch obsoleted by another one proven better, than
> just stay still waiting for something that might, or might not, happen in the
> future.
>

Because (1) you're adding a sysctl that we don't want to obsolete and
remove from the kernel that someone will come to depend on and then have
to find an alternative solution like /dev/mem_notify, and (2) people parse
messages like this that are emitted to the kernel log that we don't want
to break in the future.

So NACK on this approach.

2012-02-24 06:57:43

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Thu, 23 Feb 2012, David Rientjes wrote:
> > Great! So, why not letting the time tell us if this feature will be obsoleted
> > or not? I'd rather have this patch obsoleted by another one proven better, than
> > just stay still waiting for something that might, or might not, happen in the
> > future.
>
> Because (1) you're adding a sysctl that we don't want to obsolete and
> remove from the kernel that someone will come to depend on and then have
> to find an alternative solution like /dev/mem_notify, and (2) people parse
> messages like this that are emitted to the kernel log that we don't want
> to break in the future.
>
> So NACK on this approach.

Right. We should drop the sysctl and make it into a kernel command line
debugging option instead.

Pekka

2012-02-24 10:03:53

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Fri, 24 Feb 2012, Pekka Enberg wrote:

> Right. We should drop the sysctl and make it into a kernel command line
> debugging option instead.
>

I like how slub handles this when it can't allocate more slab with
slab_out_of_memory() and has the added benefit of still warning even with
__GFP_NORETRY that the oom killer is never called for. If there's really
a slab leak happening, there's a good chance that this diagnostic
information is going to be emitted by the offending cache at some point in
time if you're using slub. This could easily be extended to slab.c, so
it's even more reason not to include this type of information in the oom
killer.

2012-02-24 10:05:30

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Fri, Feb 24, 2012 at 12:03 PM, David Rientjes <[email protected]> wrote:
> I like how slub handles this when it can't allocate more slab with
> slab_out_of_memory() and has the added benefit of still warning even with
> __GFP_NORETRY that the oom killer is never called for. ?If there's really
> a slab leak happening, there's a good chance that this diagnostic
> information is going to be emitted by the offending cache at some point in
> time if you're using slub. ?This could easily be extended to slab.c, so
> it's even more reason not to include this type of information in the oom
> killer.

Works for me. Rafael?

2012-02-24 10:40:17

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Fri, Feb 24, 2012 at 12:05:27PM +0200, Pekka Enberg wrote:
> On Fri, Feb 24, 2012 at 12:03 PM, David Rientjes <[email protected]> wrote:
> > I like how slub handles this when it can't allocate more slab with
> > slab_out_of_memory() and has the added benefit of still warning even with
> > __GFP_NORETRY that the oom killer is never called for. ?If there's really
> > a slab leak happening, there's a good chance that this diagnostic
> > information is going to be emitted by the offending cache at some point in
> > time if you're using slub. ?This could easily be extended to slab.c, so
> > it's even more reason not to include this type of information in the oom
> > killer.
>
> Works for me. Rafael?

Sure, I'm getting back to the scratchpad right away.

2012-02-24 15:10:35

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Thu, Feb 23, 2012 at 03:09:36PM -0800, David Rientjes wrote:
> On Thu, 23 Feb 2012, Josef Bacik wrote:
>
> > I requested this specifically because I was oom'ing the box so hard that I
> > couldn't read /proc/slabinfo at the time of OOM and therefore had no idea what I
> > was leaking. Telling me how much slab was in use was no help, I needed to know
> > which of the like 6 objects I was doing horrible things with was screwing me,
> > and without this patch I would have no way of knowing.
> >
>
> So an oom was creating a denial of service so that you had no way to do
> cat /proc/slabinfo? I think we should talk about this first, because
> that's a serious situation that certainly shouldn't be happening.
>

Um well yeah, I'm rewriting a chunk of btrfs which was rapantly leaking memory
so the OOM just couldn't keep up with how much I was sucking down. This is
strictly a developer is doing something stupid and needs help pointing out what
it is sort of moment, not a day to day OOM.

> The oom killer is designed to kill the most memory-hogging task available
> so that it doesn't have to kill multiple threads. Why was the memory not
> being freed or why was the thread that was consistently being killed
> restarted time and time again so you couldn't even cat a file?
>

Memory was not being freed because it was all tied up in the metadata cache
stuff that I'm working on.

> > Sure, if the OOM killer doesn't kill the poller, or kill NetworkManager since
> > I'm remote logged into the box, or any of the other various important things
> > that would be required for me to get this info. Thanks,
> >
>
> If you're polling for oom notifications sanely, you'd probably have set
>
> echo -1000 > /proc/pid/oom_score_adj
>
> so it's unkillable as well as anything else you need to diagnose failures.
> NetworkManager itself isn't protected like this by default, but it
> shouldn't be killed unless it is leaking memory itself: we kill in the
> order of the most memory usage to the least.
>
> So neither of these are reasons to not collect /proc/slabinfo, but I'm
> very interested in your follow-up to why you can't do so when "ooming the
> box so hard" where you're presumably able to cat to kernel log file but
> not cat /proc/slabinfo :)

I'm not able to cat the kernel log file, I was using netconsole so I'd see the
OOM, and then nothing, couldn't do anything on the screen session I had and IIRC
netconsole just stopped, it required a hard reboot.

I think you are missing the point here. This is likely not usefull in normal
OOM situations, this is for my very narrow need of doing something sooo
incredibly stupid that it makes the box unusable, but I still need to get some
sort of info out so I can know _which_ stupid thing I'm doing. Thanks,

Josef

2012-02-24 21:45:36

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Fri, 24 Feb 2012, Josef Bacik wrote:

> Um well yeah, I'm rewriting a chunk of btrfs which was rapantly leaking memory
> so the OOM just couldn't keep up with how much I was sucking down. This is
> strictly a developer is doing something stupid and needs help pointing out what
> it is sort of moment, not a day to day OOM.
>

If you're debugging new kernel code and you realize that excessive amount
of memory is being consumed so that nothing can even fork, you may want to
try cat /proc/slabinfo before you get into that condition the next time
around, although I already suspect that you know the cache you're leaking.
It doesn't mean we need to add hundreds of lines of code to the kernel.
Try kmemleak.

2012-02-24 21:52:44

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Fri, 24 Feb 2012, Josef Bacik wrote:
>> Um well yeah, I'm rewriting a chunk of btrfs which was rapantly leaking memory
>> so the OOM just couldn't keep up with how much I was sucking down. ?This is
>> strictly a developer is doing something stupid and needs help pointing out what
>> it is sort of moment, not a day to day OOM.

On Fri, Feb 24, 2012 at 11:45 PM, David Rientjes <[email protected]> wrote:
> If you're debugging new kernel code and you realize that excessive amount
> of memory is being consumed so that nothing can even fork, you may want to
> try cat /proc/slabinfo before you get into that condition the next time
> around, although I already suspect that you know the cache you're leaking.
> It doesn't mean we need to add hundreds of lines of code to the kernel.
> Try kmemleak.

Kmemleak is a wonderful tool but it's also pretty heavy-weight which
makes it inconvenient in many cases.

Pekka

2012-02-24 23:51:53

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Fri, 24 Feb 2012, Pekka Enberg wrote:

> On Fri, 24 Feb 2012, Josef Bacik wrote:
> >> Um well yeah, I'm rewriting a chunk of btrfs which was rapantly leaking memory
> >> so the OOM just couldn't keep up with how much I was sucking down. ?This is
> >> strictly a developer is doing something stupid and needs help pointing out what
> >> it is sort of moment, not a day to day OOM.
>
> On Fri, Feb 24, 2012 at 11:45 PM, David Rientjes <[email protected]> wrote:
> > If you're debugging new kernel code and you realize that excessive amount
> > of memory is being consumed so that nothing can even fork, you may want to
> > try cat /proc/slabinfo before you get into that condition the next time
> > around, although I already suspect that you know the cache you're leaking.
> > It doesn't mean we need to add hundreds of lines of code to the kernel.
> > Try kmemleak.
>
> Kmemleak is a wonderful tool but it's also pretty heavy-weight which
> makes it inconvenient in many cases.
>

Too heavyweight to enable when debugging issues after "rewriting a chunk"
of a filesystem that manipulates kernel memory? I can't imagine a better
time to enable it.

2012-02-29 03:42:06

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH] oom: add sysctl to enable slab memory dump

On Fri, Feb 24, 2012 at 12:05:27PM +0200, Pekka Enberg wrote:
> On Fri, Feb 24, 2012 at 12:03 PM, David Rientjes <[email protected]> wrote:
> > I like how slub handles this when it can't allocate more slab with
> > slab_out_of_memory() and has the added benefit of still warning even with
> > __GFP_NORETRY that the oom killer is never called for. ?If there's really
> > a slab leak happening, there's a good chance that this diagnostic
> > information is going to be emitted by the offending cache at some point in
> > time if you're using slub. ?This could easily be extended to slab.c, so
> > it's even more reason not to include this type of information in the oom
> > killer.
>
> Works for me. Rafael?

New patch, following the suggested approach, posted:
https://lkml.org/lkml/2012/2/28/561

Thanks folks, for all your feedback here!
Rafael