LinuxLists.cc - [PATCH] Only print kernel debug information for OOMs caused by kernel allocations

2008-01-16 22:31:32

Subject: [PATCH] Only print kernel debug information for OOMs caused by kernel allocations

I recently suffered an 20+ minutes oom thrash disk to death and computer
completely unresponsive situation on my desktop when some user program
decided to grab all memory. It eventually recovered, but left lots
of ugly and imho misleading messages in the kernel log. here's a minor
improvement

-Andi

---

Only print kernel debug information for OOMs caused by kernel allocations

For any page cache allocation don't print the backtrace and the detailed
zone debugging information. This makes the problem look less like
a kernel bug because it typically isn't.

I needed a new task flag for that. Since the bits are running low
I reused an unused one (PF_STARTING)

Also clarify the error message (OOM means nothing to a normal user)

Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/pagemap.h | 7 -------
include/linux/sched.h | 2 +-
mm/filemap.c | 16 +++++++++++-----
mm/oom_kill.c | 9 ++++++---
4 files changed, 18 insertions(+), 16 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1358,7 +1358,7 @@ static inline void put_task_struct(struc
*/
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
/* Not implemented yet, only for 486*/
-#define PF_STARTING 0x00000002 /* being created */
+#define PF_USER_ALLOC 0x00000002 /* Current allocation is user initiated */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */
#define PF_VCPU 0x00000010 /* I'm a virtual CPU */
Index: linux/mm/oom_kill.c
===================================================================
--- linux.orig/mm/oom_kill.c
+++ linux/mm/oom_kill.c
@@ -340,11 +340,14 @@ static int oom_kill_process(struct task_
struct task_struct *c;

if (printk_ratelimit()) {
- printk(KERN_WARNING "%s invoked oom-killer: "
+ printk(KERN_WARNING "%s ran out of available memory: "
"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
current->comm, gfp_mask, order, current->oomkilladj);
- dump_stack();
- show_mem();
+ /* Only dump backtrace for kernel initiated allocations */
+ if (!(p->flags & PF_USER_ALLOC)) {
+ dump_stack();
+ show_mem();
+ }
}

/*
Index: linux/include/linux/pagemap.h
===================================================================
--- linux.orig/include/linux/pagemap.h
+++ linux/include/linux/pagemap.h
@@ -62,14 +62,7 @@ static inline void mapping_set_gfp_mask(
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);

-#ifdef CONFIG_NUMA
extern struct page *__page_cache_alloc(gfp_t gfp);
-#else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
-{
- return alloc_pages(gfp, 0);
-}
-#endif

static inline struct page *page_cache_alloc(struct address_space *x)
{
Index: linux/mm/filemap.c
===================================================================
--- linux.orig/mm/filemap.c
+++ linux/mm/filemap.c
@@ -482,17 +482,23 @@ int add_to_page_cache_lru(struct page *p
return ret;
}

-#ifdef CONFIG_NUMA
struct page *__page_cache_alloc(gfp_t gfp)
{
+ struct task_struct *me = current;
+ unsigned old = (~me->flags) & PF_USER_ALLOC;
+ struct page *p;
+
+ me->flags |= PF_USER_ALLOC;
if (cpuset_do_page_mem_spread()) {
int n = cpuset_mem_spread_node();
- return alloc_pages_node(n, gfp, 0);
- }
- return alloc_pages(gfp, 0);
+ p = alloc_pages_node(n, gfp, 0);
+ } else
+ p = alloc_pages(gfp, 0);
+ /* Clear USER_ALLOC if it wasn't set originally */
+ me->flags ^= old;
+ return p;
}
EXPORT_SYMBOL(__page_cache_alloc);
-#endif

static int __sleep_on_page_lock(void *word)
{

2008-01-16 22:55:35

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH] Only print kernel debug information for OOMs caused by kernel allocations

Andi wrote:
> here's a minor improvement

Nice - thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-28 05:52:48

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] Only print kernel debug information for OOMs caused by kernel allocations

On Wed, 16 Jan 2008 23:24:21 +0100 Andi Kleen <[email protected]> wrote:

>
> I recently suffered an 20+ minutes oom thrash disk to death and computer
> completely unresponsive situation on my desktop when some user program
> decided to grab all memory. It eventually recovered, but left lots
> of ugly and imho misleading messages in the kernel log. here's a minor
> improvement
>
> -Andi
>
> ---
>
> Only print kernel debug information for OOMs caused by kernel allocations
>
> For any page cache allocation don't print the backtrace and the detailed
> zone debugging information. This makes the problem look less like
> a kernel bug because it typically isn't.
>
> I needed a new task flag for that. Since the bits are running low
> I reused an unused one (PF_STARTING)
>
> Also clarify the error message (OOM means nothing to a normal user)
>

That information is useful for working out why a userspace allocation
attempt failed. If we don't print it, and the application gets killed and
thus frees a lot of memory, we will just never know why the allocation
failed.

> struct page *__page_cache_alloc(gfp_t gfp)
> {
> + struct task_struct *me = current;
> + unsigned old = (~me->flags) & PF_USER_ALLOC;
> + struct page *p;
> +
> + me->flags |= PF_USER_ALLOC;
> if (cpuset_do_page_mem_spread()) {
> int n = cpuset_mem_spread_node();
> - return alloc_pages_node(n, gfp, 0);
> - }
> - return alloc_pages(gfp, 0);
> + p = alloc_pages_node(n, gfp, 0);
> + } else
> + p = alloc_pages(gfp, 0);
> + /* Clear USER_ALLOC if it wasn't set originally */
> + me->flags ^= old;
> + return p;
> }

That's appreciable amount of new overhead for at best a fairly marginal
benefit. Perhaps __GFP_USER could be [re|ab]used.

Alternatively: if we've printed the diagnostic on behalf of this process
and then decided to kill it, set some flag to prevent us from printing it
again.

2008-01-28 06:11:46

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] Only print kernel debug information for OOMs caused by kernel allocations

On Monday 28 January 2008 06:52, Andrew Morton wrote:
> On Wed, 16 Jan 2008 23:24:21 +0100 Andi Kleen <[email protected]> wrote:
> > I recently suffered an 20+ minutes oom thrash disk to death and computer
> > completely unresponsive situation on my desktop when some user program
> > decided to grab all memory. It eventually recovered, but left lots
> > of ugly and imho misleading messages in the kernel log. here's a minor
> > improvement

As a followup this was with swap over dm crypt. I've recently heard
about other people having trouble with this too so this setup seems to trigger
something bad in the VM.

> That information is useful for working out why a userspace allocation
> attempt failed. If we don't print it, and the application gets killed and
> thus frees a lot of memory, we will just never know why the allocation
> failed.

But it's basically only either page fault (direct or indirect) or write et.al.
who do these page cache allocations. Do you really think it is that important
to distingush these cases individually? In 95+% of all cases it should
be a standard user page fault which always has the same backtrace.

To figure out why the application really oom'ed for those you would
need a user level backtrace, but the message doesn't supply that one anyways.

All other cases will still print the full backtrace so if some kernel
subsystem runs amok it should be still possible to diagnose it.

Please reconsider.

>
> > struct page *__page_cache_alloc(gfp_t gfp)
> > {
> > + struct task_struct *me = current;
> > + unsigned old = (~me->flags) & PF_USER_ALLOC;
> > + struct page *p;
> > +
> > + me->flags |= PF_USER_ALLOC;
> > if (cpuset_do_page_mem_spread()) {
> > int n = cpuset_mem_spread_node();
> > - return alloc_pages_node(n, gfp, 0);
> > - }
> > - return alloc_pages(gfp, 0);
> > + p = alloc_pages_node(n, gfp, 0);
> > + } else
> > + p = alloc_pages(gfp, 0);
> > + /* Clear USER_ALLOC if it wasn't set originally */
> > + me->flags ^= old;
> > + return p;
> > }
>
> That's appreciable amount of new overhead for at best a fairly marginal
> benefit. Perhaps __GFP_USER could be [re|ab]used.

It's a few non atomic bit operations. You really think that is considerable
overhead? Also all should be cache hot already. My guess is that even with the
additional function call it's < 10 cycles more.

> Alternatively: if we've printed the diagnostic on behalf of this process
> and then decided to kill it, set some flag to prevent us from printing it
> again.

Do you really think that would help? I thought these messages came usually
from different processes.

-Andi

2008-01-28 08:56:56

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] Only print kernel debug information for OOMs caused by kernel allocations

On Mon, 28 Jan 2008 07:10:07 +0100 Andi Kleen <[email protected]> wrote:

> On Monday 28 January 2008 06:52, Andrew Morton wrote:
> > On Wed, 16 Jan 2008 23:24:21 +0100 Andi Kleen <[email protected]> wrote:
> > > I recently suffered an 20+ minutes oom thrash disk to death and computer
> > > completely unresponsive situation on my desktop when some user program
> > > decided to grab all memory. It eventually recovered, but left lots
> > > of ugly and imho misleading messages in the kernel log. here's a minor
> > > improvement
>
> As a followup this was with swap over dm crypt. I've recently heard
> about other people having trouble with this too so this setup seems to trigger
> something bad in the VM.

Where's the backtrace and show_mem() output? :)

> > That information is useful for working out why a userspace allocation
> > attempt failed. If we don't print it, and the application gets killed and
> > thus frees a lot of memory, we will just never know why the allocation
> > failed.
>
> But it's basically only either page fault (direct or indirect) or write et.al.
> who do these page cache allocations. Do you really think it is that important
> to distingush these cases individually? In 95+% of all cases it should
> be a standard user page fault which always has the same backtrace.

Sure, the backtrace isn't very important. The show_mem() output is vital.

> To figure out why the application really oom'ed for those you would
> need a user level backtrace, but the message doesn't supply that one anyways.
>
> All other cases will still print the full backtrace so if some kernel
> subsystem runs amok it should be still possible to diagnose it.
>

We need the show_mem() output to see where all the memory went, and to see
what state page reclaim is in.

>
> >
> > > struct page *__page_cache_alloc(gfp_t gfp)
> > > {
> > > + struct task_struct *me = current;
> > > + unsigned old = (~me->flags) & PF_USER_ALLOC;
> > > + struct page *p;
> > > +
> > > + me->flags |= PF_USER_ALLOC;
> > > if (cpuset_do_page_mem_spread()) {
> > > int n = cpuset_mem_spread_node();
> > > - return alloc_pages_node(n, gfp, 0);
> > > - }
> > > - return alloc_pages(gfp, 0);
> > > + p = alloc_pages_node(n, gfp, 0);
> > > + } else
> > > + p = alloc_pages(gfp, 0);
> > > + /* Clear USER_ALLOC if it wasn't set originally */
> > > + me->flags ^= old;
> > > + return p;
> > > }
> >
> > That's appreciable amount of new overhead for at best a fairly marginal
> > benefit. Perhaps __GFP_USER could be [re|ab]used.
>
> It's a few non atomic bit operations. You really think that is considerable
> overhead? Also all should be cache hot already. My guess is that even with the
> additional function call it's < 10 cycles more.

Plus an additional function call. On the already-deep page allocation
path, I might add.

> > Alternatively: if we've printed the diagnostic on behalf of this process
> > and then decided to kill it, set some flag to prevent us from printing it
> > again.
>
> Do you really think that would help? I thought these messages came usually
> from different processes.

Dunno.

2008-01-28 09:13:37

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] Only print kernel debug information for OOMs caused by kernel allocations

On Monday 28 January 2008 09:56, Andrew Morton wrote:
> On Mon, 28 Jan 2008 07:10:07 +0100 Andi Kleen <[email protected]> wrote:
> > On Monday 28 January 2008 06:52, Andrew Morton wrote:
> > > On Wed, 16 Jan 2008 23:24:21 +0100 Andi Kleen <[email protected]> wrote:
> > > > I recently suffered an 20+ minutes oom thrash disk to death and
> > > > computer completely unresponsive situation on my desktop when some
> > > > user program decided to grab all memory. It eventually recovered, but
> > > > left lots of ugly and imho misleading messages in the kernel log.
> > > > here's a minor improvement
> >
> > As a followup this was with swap over dm crypt. I've recently heard
> > about other people having trouble with this too so this setup seems to
> > trigger something bad in the VM.
>
> Where's the backtrace and show_mem() output? :)

I don't have it anymore. You want me to reproduce it? I don't think
I saw messages from the other people either; just heard complaints.

> > > That information is useful for working out why a userspace allocation
> > > attempt failed. If we don't print it, and the application gets killed
> > > and thus frees a lot of memory, we will just never know why the
> > > allocation failed.
> >
> > But it's basically only either page fault (direct or indirect) or write
> > et.al. who do these page cache allocations. Do you really think it is
> > that important to distingush these cases individually? In 95+% of all
> > cases it should be a standard user page fault which always has the same
> > backtrace.
>
> Sure, the backtrace isn't very important. The show_mem() output is vital.

I see. So would the patch be acceptable if it only disabled the backtrace?

> Plus an additional function call. On the already-deep page allocation
> path, I might add.

The function call is already there if the kernel has CPUSETs enabled.
And that is what distribution kernels usually do. And most users
use distribution kernels or distribution .config.

-Andi

2008-01-28 09:27:21

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] Only print kernel debug information for OOMs caused by kernel allocations

On Mon, 28 Jan 2008 10:11:57 +0100 Andi Kleen <[email protected]> wrote:

> On Monday 28 January 2008 09:56, Andrew Morton wrote:
> > On Mon, 28 Jan 2008 07:10:07 +0100 Andi Kleen <[email protected]> wrote:
> > > On Monday 28 January 2008 06:52, Andrew Morton wrote:
> > > > On Wed, 16 Jan 2008 23:24:21 +0100 Andi Kleen <[email protected]> wrote:
> > > > > I recently suffered an 20+ minutes oom thrash disk to death and
> > > > > computer completely unresponsive situation on my desktop when some
> > > > > user program decided to grab all memory. It eventually recovered, but
> > > > > left lots of ugly and imho misleading messages in the kernel log.
> > > > > here's a minor improvement
> > >
> > > As a followup this was with swap over dm crypt. I've recently heard
> > > about other people having trouble with this too so this setup seems to
> > > trigger something bad in the VM.
> >
> > Where's the backtrace and show_mem() output? :)
>
> I don't have it anymore. You want me to reproduce it? I don't think
> I saw messages from the other people either; just heard complaints.

May as well - it doesn't sound like it'll fix itself...

> > > > That information is useful for working out why a userspace allocation
> > > > attempt failed. If we don't print it, and the application gets killed
> > > > and thus frees a lot of memory, we will just never know why the
> > > > allocation failed.
> > >
> > > But it's basically only either page fault (direct or indirect) or write
> > > et.al. who do these page cache allocations. Do you really think it is
> > > that important to distingush these cases individually? In 95+% of all
> > > cases it should be a standard user page fault which always has the same
> > > backtrace.
> >
> > Sure, the backtrace isn't very important. The show_mem() output is vital.
>
> I see. So would the patch be acceptable if it only disabled the backtrace?

Spose so. The show_mem() spew is probably larger than the backtrace
though.

Are you sure we aren't doing dump_stack()/show_mem() mutiple times for a
single process? If we are, that would mena the TIF_MEMDIE thing broke.

It must have been one heck of an oomkilling slaughter.

> > Plus an additional function call. On the already-deep page allocation
> > path, I might add.
>
> The function call is already there if the kernel has CPUSETs enabled.

s/CPUSETS/NUMA/, which makes rather a difference.

> And that is what distribution kernels usually do. And most users
> use distribution kernels or distribution .config.