2006-05-18 17:29:59

by Dave Peterson

[permalink] [raw]
Subject: [PATCH] mm: avoid unnecessary OOM kills

Below is a 2.6.17-rc4-mm1 patch that fixes a problem where the OOM killer was
unnecessarily killing system daemons in addition to memory-hogging user
processes. The patch fixes things so that the following assertion is
satisfied:

If a failed attempt to allocate memory triggers the OOM killer, then the
failed attempt must have occurred _after_ any process previously shot by
the OOM killer has cleaned out its mm_struct.

Thus we avoid situations where concurrent invocations of the OOM killer cause
more processes to be shot than necessary to resolve the OOM condition.

Signed-Off-By: David S. Peterson <[email protected]>
---

This is a repost of a patch discussed a few weeks ago. I got sidetracked for
a while, and am just now getting back to this. The patch below incorporates
feedback from previous discussion and fixes a minor race condition in my
previous code.


Index: linux-2.6.17-rc4-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.17-rc4-mm1.orig/include/linux/sched.h 2006-05-17 22:31:38.000000000 -0700
+++ linux-2.6.17-rc4-mm1/include/linux/sched.h 2006-05-17 22:33:54.000000000 -0700
@@ -296,6 +296,9 @@
(mm)->hiwater_vm = (mm)->total_vm; \
} while (0)

+/* bit #s for flags in mm_struct->flags... */
+#define MM_FLAG_OOM_NOTIFY 0
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -354,6 +357,8 @@
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+ unsigned long flags;
};

struct sighand_struct {
Index: linux-2.6.17-rc4-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.17-rc4-mm1.orig/include/linux/swap.h 2006-05-17 22:31:38.000000000 -0700
+++ linux-2.6.17-rc4-mm1/include/linux/swap.h 2006-05-17 22:33:54.000000000 -0700
@@ -6,6 +6,7 @@
#include <linux/mmzone.h>
#include <linux/list.h>
#include <linux/sched.h>
+#include <linux/bitops.h>

#include <asm/atomic.h>
#include <asm/page.h>
@@ -155,6 +156,29 @@
#define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)

/* linux/mm/oom_kill.c */
+extern volatile unsigned long oom_kill_in_progress;
+
+/*
+ * Attempt to start an OOM kill operation. Return 0 on success, or 1 if an
+ * OOM kill is already in progress.
+ */
+static inline int oom_kill_start(void)
+{
+ return test_and_set_bit(0, &oom_kill_in_progress);
+}
+
+/*
+ * Terminate an OOM kill operation.
+ *
+ * When the OOM killer chooses a victim, it sets a flag in the victim's
+ * mm_struct. mmput() then calls this function when the mm_users count has
+ * reached 0 and the contents of the mm_struct have been cleaned out.
+ */
+static inline void oom_kill_finish(void)
+{
+ clear_bit(0, &oom_kill_in_progress);
+}
+
extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);

/* linux/mm/memory.c */
Index: linux-2.6.17-rc4-mm1/kernel/fork.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/kernel/fork.c 2006-05-17 22:31:38.000000000 -0700
+++ linux-2.6.17-rc4-mm1/kernel/fork.c 2006-05-17 22:33:54.000000000 -0700
@@ -328,6 +328,7 @@
mm->ioctx_list = NULL;
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
+ mm->flags = 0;

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
@@ -381,6 +382,8 @@
spin_unlock(&mmlist_lock);
}
put_swap_token(mm);
+ if (unlikely(test_bit(MM_FLAG_OOM_NOTIFY, &mm->flags)))
+ oom_kill_finish(); /* terminate pending OOM kill */
mmdrop(mm);
}
}
Index: linux-2.6.17-rc4-mm1/mm/oom_kill.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/oom_kill.c 2006-05-17 22:31:38.000000000 -0700
+++ linux-2.6.17-rc4-mm1/mm/oom_kill.c 2006-05-17 22:33:54.000000000 -0700
@@ -25,6 +25,8 @@
int sysctl_panic_on_oom;
/* #define DEBUG */

+volatile unsigned long oom_kill_in_progress = 0;
+
/**
* badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
@@ -260,27 +262,31 @@
struct mm_struct *mm;
task_t * g, * q;

+ task_lock(p);
mm = p->mm;

- /* WARNING: mm may not be dereferenced since we did not obtain its
- * value from get_task_mm(p). This is OK since all we need to do is
- * compare mm to q->mm below.
+ if (mm == NULL || mm == &init_mm) {
+ task_unlock(p);
+ return 1;
+ }
+
+ set_bit(MM_FLAG_OOM_NOTIFY, &mm->flags);
+ task_unlock(p);
+
+ /* WARNING: mm may no longer be dereferenced since we did not obtain
+ * its value from get_task_mm(p). This is OK since all we need to do
+ * is compare mm to q->mm below.
*
* Furthermore, even if mm contains a non-NULL value, p->mm may
- * change to NULL at any time since we do not hold task_lock(p).
+ * change to NULL at any time since we no longer hold task_lock(p).
* However, this is of no concern to us.
*/

- if (mm == NULL || mm == &init_mm)
- return 1;
-
- __oom_kill_task(p, message);
/*
- * kill all processes that share the ->mm (i.e. all threads),
- * but are in a different thread group
+ * kill all processes that share the ->mm (i.e. all threads)
*/
do_each_thread(g, q)
- if (q->mm == mm && q->tgid != p->tgid)
+ if (q->mm == mm)
__oom_kill_task(q, message);
while_each_thread(g, q);

@@ -313,19 +319,20 @@
* killing a random task (bad), letting the system crash (worse)
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
+ *
+ * Note: Caller must have obtained a value of 0 from oom_kill_start().
+ * However, by calling this function the caller absolves itself of
+ * responsibility for calling oom_kill_finish().
*/
void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
{
task_t *p;
unsigned long points = 0;
+ int cancel = 0;

- if (printk_ratelimit()) {
- printk("oom-killer: gfp_mask=0x%x, order=%d\n",
- gfp_mask, order);
- dump_stack();
- show_mem();
- }
-
+ printk("oom-killer: gfp_mask=0x%x, order=%d\n", gfp_mask, order);
+ dump_stack();
+ show_mem();
cpuset_lock();
read_lock(&tasklist_lock);

@@ -335,12 +342,12 @@
*/
switch (constrained_alloc(zonelist, gfp_mask)) {
case CONSTRAINT_MEMORY_POLICY:
- oom_kill_process(current, points,
+ cancel = oom_kill_process(current, points,
"No available memory (MPOL_BIND)");
break;

case CONSTRAINT_CPUSET:
- oom_kill_process(current, points,
+ cancel = oom_kill_process(current, points,
"No available memory in cpuset");
break;

@@ -354,8 +361,10 @@
*/
p = select_bad_process(&points);

- if (PTR_ERR(p) == -1UL)
+ if (PTR_ERR(p) == -1UL) {
+ cancel = 1;
goto out;
+ }

/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p) {
@@ -374,6 +383,9 @@
read_unlock(&tasklist_lock);
cpuset_unlock();

+ if (cancel)
+ oom_kill_finish(); /* cancel OOM kill operation */
+
/*
* Give "p" a good chance of killing itself before we
* retry to allocate memory unless "p" is current
Index: linux-2.6.17-rc4-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/page_alloc.c 2006-05-17 22:31:38.000000000 -0700
+++ linux-2.6.17-rc4-mm1/mm/page_alloc.c 2006-05-17 22:33:54.000000000 -0700
@@ -910,6 +910,36 @@
return 1;
}

+/* Try to allocate one more time before invoking the OOM killer. */
+static struct page * oom_alloc(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist)
+{
+ struct page *page;
+
+ /* The use of oom_kill_start() below prevents parallel OOM kill
+ * operations. This fixes a problem where the OOM killer was observed
+ * shooting system daemons in addition to memory-hogging user
+ * processes.
+ */
+ if (oom_kill_start())
+ return NULL; /* previous OOM kill still in progress */
+
+ /* If we get this far, we _know_ that any previous OOM killer victim
+ * has cleaned out its mm_struct. Therefore we should pick a victim
+ * to shoot if the allocation below fails.
+ */
+ page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
+ zonelist, ALLOC_WMARK_HIGH | ALLOC_CPUSET);
+
+ if (page) {
+ oom_kill_finish(); /* cancel OOM kill operation */
+ return page;
+ }
+
+ out_of_memory(zonelist, gfp_mask, order);
+ return NULL;
+}
+
/*
* get_page_from_freeliest goes through the zonelist trying to allocate
* a page.
@@ -1116,18 +1146,8 @@
if (page)
goto got_pg;
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
- /*
- * Go through the zonelist yet one more time, keep
- * very high watermark here, this is only to catch
- * a parallel oom killing, we must fail if we're still
- * under heavy pressure.
- */
- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
- zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
- if (page)
+ if ((page = oom_alloc(gfp_mask, order, zonelist)) != NULL)
goto got_pg;
-
- out_of_memory(zonelist, gfp_mask, order);
goto restart;
}


2006-05-18 20:05:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: avoid unnecessary OOM kills

Dave Peterson <[email protected]> wrote:
>
> Below is a 2.6.17-rc4-mm1 patch that fixes a problem where the OOM killer was
> unnecessarily killing system daemons in addition to memory-hogging user
> processes. The patch fixes things so that the following assertion is
> satisfied:
>
> If a failed attempt to allocate memory triggers the OOM killer, then the
> failed attempt must have occurred _after_ any process previously shot by
> the OOM killer has cleaned out its mm_struct.
>
> Thus we avoid situations where concurrent invocations of the OOM killer cause
> more processes to be shot than necessary to resolve the OOM condition.
>
> ...
>
> --- linux-2.6.17-rc4-mm1.orig/include/linux/swap.h 2006-05-17 22:31:38.000000000 -0700
> +++ linux-2.6.17-rc4-mm1/include/linux/swap.h 2006-05-17 22:33:54.000000000 -0700
> @@ -155,6 +156,29 @@
> #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
>
> /* linux/mm/oom_kill.c */
> +extern volatile unsigned long oom_kill_in_progress;

This shouldn't be volatile.

> +/*
> + * Attempt to start an OOM kill operation. Return 0 on success, or 1 if an
> + * OOM kill is already in progress.
> + */
> +static inline int oom_kill_start(void)
> +{
> + return test_and_set_bit(0, &oom_kill_in_progress);
> +}

Suggest this be called oom_kill_trystart().

> ===================================================================
> --- linux-2.6.17-rc4-mm1.orig/mm/oom_kill.c 2006-05-17 22:31:38.000000000 -0700
> +++ linux-2.6.17-rc4-mm1/mm/oom_kill.c 2006-05-17 22:33:54.000000000 -0700
> @@ -25,6 +25,8 @@
> int sysctl_panic_on_oom;
> /* #define DEBUG */
>
> +volatile unsigned long oom_kill_in_progress = 0;

This shouldn't be initialised to zero. The kernel zeroes bss at startup.

> /**
> * badness - calculate a numeric value for how bad this task has been
> * @p: task struct of which task we should calculate
> @@ -260,27 +262,31 @@
> struct mm_struct *mm;
> task_t * g, * q;
>
> + task_lock(p);
> mm = p->mm;
>
> - /* WARNING: mm may not be dereferenced since we did not obtain its
> - * value from get_task_mm(p). This is OK since all we need to do is
> - * compare mm to q->mm below.
> + if (mm == NULL || mm == &init_mm) {
> + task_unlock(p);
> + return 1;
> + }
> +
> + set_bit(MM_FLAG_OOM_NOTIFY, &mm->flags);
> + task_unlock(p);

Putting task_lock() in here would be a fairly obvious way to address the
fragility which that comment describes. But I have a feeling that we
discussed that a couple of weeks ago and found a problem with it, didn't we?

> ===================================================================
> --- linux-2.6.17-rc4-mm1.orig/mm/page_alloc.c 2006-05-17 22:31:38.000000000 -0700
> +++ linux-2.6.17-rc4-mm1/mm/page_alloc.c 2006-05-17 22:33:54.000000000 -0700
> @@ -910,6 +910,36 @@
> return 1;
> }
>
> +/* Try to allocate one more time before invoking the OOM killer. */
> +static struct page * oom_alloc(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist)
> +{
> + struct page *page;
> +
> + /* The use of oom_kill_start() below prevents parallel OOM kill
> + * operations. This fixes a problem where the OOM killer was observed
> + * shooting system daemons in addition to memory-hogging user
> + * processes.
> + */
> + if (oom_kill_start())
> + return NULL; /* previous OOM kill still in progress */
> +
> + /* If we get this far, we _know_ that any previous OOM killer victim
> + * has cleaned out its mm_struct. Therefore we should pick a victim
> + * to shoot if the allocation below fails.
> + */
> + page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
> + zonelist, ALLOC_WMARK_HIGH | ALLOC_CPUSET);
> +
> + if (page) {
> + oom_kill_finish(); /* cancel OOM kill operation */
> + return page;
> + }
> +
> + out_of_memory(zonelist, gfp_mask, order);
> + return NULL;
> +}
> +
> /*
> * get_page_from_freeliest goes through the zonelist trying to allocate
> * a page.
> @@ -1116,18 +1146,8 @@
> if (page)
> goto got_pg;
> } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> - /*
> - * Go through the zonelist yet one more time, keep
> - * very high watermark here, this is only to catch
> - * a parallel oom killing, we must fail if we're still
> - * under heavy pressure.
> - */
> - page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
> - zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
> - if (page)
> + if ((page = oom_alloc(gfp_mask, order, zonelist)) != NULL)
> goto got_pg;
> -
> - out_of_memory(zonelist, gfp_mask, order);
> goto restart;
> }

So if process A did a successful oom_kill_start(), and processes B, C, D
and E all get into difficulty as well, they'll go into busywait loops. The
cond_resched() will eventually save us from locking up, but it's a bit
unpleasant.

And I'm trying to work out where process A will run oom_kill_finish() if
`cancel' ends up being 0 in out_of_memory(). afaict, it doesn't, and the
oom-killer will be accidentally disabled?

2006-05-19 02:39:32

by Dave Peterson

[permalink] [raw]
Subject: Re: [PATCH] mm: avoid unnecessary OOM kills

At 01:05 PM 5/18/2006, Andrew Morton wrote:
>> /* linux/mm/oom_kill.c */
>> +extern volatile unsigned long oom_kill_in_progress;
>
>This shouldn't be volatile.

Yes, I'll fix that.

>> +/*
>> + * Attempt to start an OOM kill operation. Return 0 on success, or 1 if an
>> + * OOM kill is already in progress.
>> + */
>> +static inline int oom_kill_start(void)
>> +{
>> + return test_and_set_bit(0, &oom_kill_in_progress);
>> +}
>
>Suggest this be called oom_kill_trystart().

Sounds good. I'll change the name.

>> +volatile unsigned long oom_kill_in_progress = 0;
>
>This shouldn't be initialised to zero. The kernel zeroes bss at startup.

Ok, I'll fix this.

>> /**
>> * badness - calculate a numeric value for how bad this task has been
>> * @p: task struct of which task we should calculate
>> @@ -260,27 +262,31 @@
>> struct mm_struct *mm;
>> task_t * g, * q;
>>
>> + task_lock(p);
>> mm = p->mm;
>>
>> - /* WARNING: mm may not be dereferenced since we did not obtain its
>> - * value from get_task_mm(p). This is OK since all we need to do is
>> - * compare mm to q->mm below.
>> + if (mm == NULL || mm == &init_mm) {
>> + task_unlock(p);
>> + return 1;
>> + }
>> +
>> + set_bit(MM_FLAG_OOM_NOTIFY, &mm->flags);
>> + task_unlock(p);
>
>Putting task_lock() in here would be a fairly obvious way to address the
>fragility which that comment describes. But I have a feeling that we
>discussed that a couple of weeks ago and found a problem with it, didn't we?

I think task_lock() works fine. That's what I have above.
When I generated the patch, it looks like I omitted the "-p" flag
to diff. Thus it looks like the code changes above are in badness()
when they are actually in oom_kill_task(). My apologies (I know this
makes things hard to read). I'll fix this when I repost.

>So if process A did a successful oom_kill_start(), and processes B, C, D
>and E all get into difficulty as well, they'll go into busywait loops. The
>cond_resched() will eventually save us from locking up, but it's a bit
>unpleasant.

Hmm... that's a good point. I kind of assumed the other processes
would go to sleep somewhere inside one of the calls to
get_page_from_freelist(), but it looks like this is not the case.
Without my changes, all of the processes would call out_of_memory()
and then loop back around to "restart". However there's a call to
schedule_timeout_interruptible() in out_of_memory() that processes B,
C, D, and E will circumvent if my changes are applied. I'll try to
work out a solution to this and then repost.

On somewhat of a tangent, it seems less than ideal for the allocator
to loop when memory is scarce (even if we may sleep on each
iteration). An alternative might be something like this: When a task
calls __alloc_pages(), it goes through the zonelists looking for pages
to allocate. If this fails, the task adds itself to a queue and goes
to sleep. Then when kswapd() or someone else frees some pages, the
pages are given to queued tasks, which are then awakened with their
requests satisfied. The memory allocator might optionally choose to
awaken a task without satisfying its request. Then __alloc_pages()
would return NULL for that task.

I haven't thought through the details of implementing something like
this in Linux. Perhaps there are issues I'm not aware of that would
make the above approach impractical. However I'd be curious to hear
your thoughts.

>And I'm trying to work out where process A will run oom_kill_finish() if
>`cancel' ends up being 0 in out_of_memory(). afaict, it doesn't, and the
>oom-killer will be accidentally disabled?

In this case, mmput() will call oom_kill_finish() once the mm_users
count on the mm_struct of the OOM killer victim has reached zero and
the mm_struct has been cleaned out. Thus additional OOM kills are
prevented until the victim has freed its memory. Again, this is a bit
unclear because I messed up generating the patch. Sorry about that.
Will fix shortly...

I should probably also do a better job of adding comments along with
my code changes.