LinuxLists.cc - [rfc][patch 0/5] 2.6.19-rc1: oom killer fixes

2006-10-12 14:09:38

Subject: [rfc][patch 0/5] 2.6.19-rc1: oom killer fixes

I've been prompted to take another look through the OOM killer because it
turns out it is killing tasks that have had their oom_adj set to -17 (which
is supposed to make them unkillable).

So there are a number of problems, firstly, the child and sibling thread
killing routines do not account for -17 children/siblings.

Secondly, most architecture specific pagefault handlers do a direct kill
of the current process if it takes a VM_FAULT_OOM. This is a pretty rare
thing to happen, because there isn't a lot of higher order allocations
happening, but it is not impossible. I think we can just call into the
OOM killer here, and return to userspace... but I'd like comments about
this.

Thanks,
Nick
--
SuSE Labs

2006-10-12 14:09:48

by Nick Piggin

[permalink] [raw]

Subject: [patch 1/5] oom: don't kill unkillable children or siblings

Abort the kill if any of our threads have OOM_DISABLE set. Having this test
here also prevents any OOM_DISABLE child of the "selected" process from being
killed.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/oom_kill.c
===================================================================
--- linux-2.6.orig/mm/oom_kill.c
+++ linux-2.6/mm/oom_kill.c
@@ -312,15 +312,24 @@ static int oom_kill_task(struct task_str
if (mm == NULL)
return 1;

+ /*
+ * Don't kill the process if any threads are set to OOM_DISABLE
+ */
+ do_each_thread(g, q) {
+ if (q->mm == mm && p->oomkilladj == OOM_DISABLE)
+ return 1;
+ } while_each_thread(g, q);
+
__oom_kill_task(p, message);
+
/*
* kill all processes that share the ->mm (i.e. all threads),
* but are in a different thread group
*/
- do_each_thread(g, q)
+ do_each_thread(g, q) {
if (q->mm == mm && q->tgid != p->tgid)
__oom_kill_task(q, message);
- while_each_thread(g, q);
+ } while_each_thread(g, q);

return 0;
}

2006-10-12 14:10:01

by Nick Piggin

[permalink] [raw]

Subject: [patch 2/5] oom: cleanup messages

Clean up the OOM killer messages to be more consistent.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/oom_kill.c
===================================================================
--- linux-2.6.orig/mm/oom_kill.c
+++ linux-2.6/mm/oom_kill.c
@@ -263,7 +263,7 @@ static struct task_struct *select_bad_pr
* flag though it's unlikely that we select a process with CAP_SYS_RAW_IO
* set.
*/
-static void __oom_kill_task(struct task_struct *p, const char *message)
+static void __oom_kill_task(struct task_struct *p, int verbose)
{
if (is_init(p)) {
WARN_ON(1);
@@ -277,10 +277,8 @@ static void __oom_kill_task(struct task_
return;
}

- if (message) {
- printk(KERN_ERR "%s: Killed process %d (%s).\n",
- message, p->pid, p->comm);
- }
+ if (verbose)
+ printk(KERN_ERR "Killed process %d (%s).\n", p->pid, p->comm);

/*
* We give our sacrificial lamb high priority and access to
@@ -293,7 +291,7 @@ static void __oom_kill_task(struct task_
force_sig(SIGKILL, p);
}

-static int oom_kill_task(struct task_struct *p, const char *message)
+static int oom_kill_task(struct task_struct *p)
{
struct mm_struct *mm;
struct task_struct *g, *q;
@@ -320,15 +318,15 @@ static int oom_kill_task(struct task_str
return 1;
} while_each_thread(g, q);

- __oom_kill_task(p, message);
+ __oom_kill_task(p, 1);

/*
* kill all processes that share the ->mm (i.e. all threads),
- * but are in a different thread group
+ * but are in a different thread group.
*/
do_each_thread(g, q) {
if (q->mm == mm && q->tgid != p->tgid)
- __oom_kill_task(q, message);
+ __oom_kill_task(q, 1);
} while_each_thread(g, q);

return 0;
@@ -345,21 +343,22 @@ static int oom_kill_process(struct task_
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
- __oom_kill_task(p, NULL);
+ __oom_kill_task(p, 0);
return 0;
}

- printk(KERN_ERR "Out of Memory: Kill process %d (%s) score %li"
- " and children.\n", p->pid, p->comm, points);
+ printk(KERN_ERR "%s: kill process %d (%s) score %li or a child.\n",
+ message, p->pid, p->comm, points);
+
/* Try to kill a child first */
list_for_each(tsk, &p->children) {
c = list_entry(tsk, struct task_struct, sibling);
if (c->mm == p->mm)
continue;
- if (!oom_kill_task(c, message))
+ if (!oom_kill_task(c))
return 0;
}
- return oom_kill_task(p, message);
+ return oom_kill_task(p);
}

static BLOCKING_NOTIFIER_HEAD(oom_notify_list);

2006-10-12 14:10:48

by Nick Piggin

[permalink] [raw]

Subject: [patch 3/5] oom: less memdie

Don't cause all threads in all other thread groups to gain TIF_MEMDIE
otherwise we'll get a thundering herd eating out memory reserve. This
may not be the optimal scheme, but it fits our policy of allowing just
one TIF_MEMDIE in the system at once.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/oom_kill.c
===================================================================
--- linux-2.6.orig/mm/oom_kill.c
+++ linux-2.6/mm/oom_kill.c
@@ -322,11 +322,12 @@ static int oom_kill_task(struct task_str

/*
* kill all processes that share the ->mm (i.e. all threads),
- * but are in a different thread group.
+ * but are in a different thread group. Don't let them have access
+ * to memory reserves though, otherwise we might deplete all memory.
*/
do_each_thread(g, q) {
if (q->mm == mm && q->tgid != p->tgid)
- __oom_kill_task(q, 1);
+ force_sig(SIGKILL, p);
} while_each_thread(g, q);

return 0;

2006-10-12 14:11:03

by Nick Piggin

[permalink] [raw]

Subject: [patch 5/5] oom: invoke OOM killer from pagefault handler

Rather than have the pagefault handler kill a process directly if it gets a
VM_FAULT_OOM, have it call into the OOM killer.

Only converted a few architectures so far - this is just an RFC.

Index: linux-2.6/mm/oom_kill.c
===================================================================
--- linux-2.6.orig/mm/oom_kill.c
+++ linux-2.6/mm/oom_kill.c
@@ -376,6 +376,57 @@ int unregister_oom_notifier(struct notif
}
EXPORT_SYMBOL_GPL(unregister_oom_notifier);

+/*
+ * Must be called with cpuset_lock and tasklist_lock held for read.
+ */
+void __out_of_memory(void)
+{
+ unsigned long points = 0;
+ struct task_struct *p;
+
+ if (sysctl_panic_on_oom)
+ panic("out of memory. panic_on_oom is selected\n");
+retry:
+ /*
+ * Rambo mode: Shoot down a process and hope it solves whatever
+ * issues we may have.
+ */
+ p = select_bad_process(&points);
+
+ if (PTR_ERR(p) == -1UL)
+ return;
+
+ /* Found nothing?!?! Either we hang forever, or we panic. */
+ if (!p) {
+ read_unlock(&tasklist_lock);
+ cpuset_unlock();
+ panic("Out of memory and no killable processes...\n");
+ }
+
+ if (oom_kill_process(p, points, "Out of memory"))
+ goto retry;
+}
+
+/*
+ * pagefault handler calls into here because it is out of memory but
+ * doesn't know exactly how or why.
+ */
+void pagefault_out_of_memory(void)
+{
+ if (printk_ratelimit()) {
+ printk(KERN_WARNING "%s invoked oom-killer from pagefault: "
+ "oomkilladj=%d\n", current->oomkilladj);
+ dump_stack();
+ show_mem();
+ }
+
+ cpuset_lock();
+ read_lock(&tasklist_lock);
+ __out_of_memory();
+ read_unlock(&tasklist_lock);
+ cpuset_unlock();
+}
+
/**
* out_of_memory - kill the "best" process when we run out of memory
*
@@ -386,8 +437,6 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifie
*/
void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
{
- struct task_struct *p;
- unsigned long points = 0;
unsigned long freed = 0;

blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -412,42 +461,18 @@ void out_of_memory(struct zonelist *zone
*/
switch (constrained_alloc(zonelist, gfp_mask)) {
case CONSTRAINT_MEMORY_POLICY:
- oom_kill_process(current, points,
- "No available memory (MPOL_BIND)");
+ oom_kill_process(current, 0, "No available memory (MPOL_BIND)");
break;

case CONSTRAINT_CPUSET:
- oom_kill_process(current, points,
- "No available memory in cpuset");
+ oom_kill_process(current, 0, "No available memory in cpuset");
break;

case CONSTRAINT_NONE:
- if (sysctl_panic_on_oom)
- panic("out of memory. panic_on_oom is selected\n");
-retry:
- /*
- * Rambo mode: Shoot down a process and hope it solves whatever
- * issues we may have.
- */
- p = select_bad_process(&points);
-
- if (PTR_ERR(p) == -1UL)
- goto out;
-
- /* Found nothing?!?! Either we hang forever, or we panic. */
- if (!p) {
- read_unlock(&tasklist_lock);
- cpuset_unlock();
- panic("Out of memory and no killable processes...\n");
- }
-
- if (oom_kill_process(p, points, "Out of memory"))
- goto retry;
-
+ __out_of_memory();
break;
}

-out:
read_unlock(&tasklist_lock);
cpuset_unlock();

Index: linux-2.6/arch/alpha/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/alpha/mm/fault.c
+++ linux-2.6/arch/alpha/mm/fault.c
@@ -143,7 +143,6 @@ do_page_fault(unsigned long address, uns
goto bad_area;
}

- survive:
/* If for any reason at all we couldn't handle the fault,
make sure we exit gracefully rather than endlessly redo
the fault. */
@@ -190,19 +189,13 @@ do_page_fault(unsigned long address, uns
die_if_kernel("Oops", regs, cause, (unsigned long*)regs - 16);
do_exit(SIGKILL);

- /* We ran out of memory, or some other thing happened to us that
- made us unable to handle the page fault gracefully. */
+ /*
+ * We ran out of memory, call the OOM killer, and return to userspace
+ * (the fault will be retried if we weren't killed)
+ */
out_of_memory:
- if (is_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk(KERN_ALERT "VM: killing process %s(%d)\n",
- current->comm, current->pid);
- if (!user_mode(regs))
- goto no_context;
- do_exit(SIGKILL);
+ pagefault_out_of_memory();
+ return;

do_sigbus:
/* Send a sigbus, regardless of whether we were in kernel
Index: linux-2.6/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/i386/mm/fault.c
+++ linux-2.6/arch/i386/mm/fault.c
@@ -444,7 +444,6 @@ good_area:
goto bad_area;
}

- survive:
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -583,21 +582,14 @@ no_context:
bust_spinlocks(0);
do_exit(SIGKILL);

-/*
- * We ran out of memory, or some other thing happened to us that made
- * us unable to handle the page fault gracefully.
- */
out_of_memory:
+ /*
+ * We ran out of memory, call the OOM killer, and return to userspace
+ * (the fault will be retried if we weren't killed)
+ */
up_read(&mm->mmap_sem);
- if (is_init(tsk)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk("VM: killing process %s\n", tsk->comm);
- if (error_code & 4)
- do_exit(SIGKILL);
- goto no_context;
+ pagefault_out_of_memory();
+ return;

do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/ia64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/fault.c
+++ linux-2.6/arch/ia64/mm/fault.c
@@ -155,7 +155,6 @@ ia64_do_page_fault (unsigned long addres
if ((vma->vm_flags & mask) != mask)
goto bad_area;

- survive:
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the
@@ -280,13 +279,10 @@ ia64_do_page_fault (unsigned long addres

out_of_memory:
up_read(&mm->mmap_sem);
- if (is_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk(KERN_CRIT "VM: killing process %s\n", current->comm);
- if (user_mode(regs))
- do_exit(SIGKILL);
- goto no_context;
+ /*
+ * We ran out of memory, call the OOM killer, and return to userspace
+ * (the fault will be retried if we weren't killed)
+ */
+ pagefault_out_of_memory();
+ return;
}
Index: linux-2.6/arch/powerpc/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/fault.c
+++ linux-2.6/arch/powerpc/mm/fault.c
@@ -342,7 +342,6 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- survive:
switch (handle_mm_fault(mm, vma, address, is_write)) {

case VM_FAULT_MINOR:
@@ -380,21 +379,14 @@ bad_area_nosemaphore:

return SIGSEGV;

-/*
- * We ran out of memory, or some other thing happened to us that made
- * us unable to handle the page fault gracefully.
- */
out_of_memory:
+ /*
+ * We ran out of memory, call the OOM killer, and return to userspace
+ * (the fault will be retried if we weren't killed)
+ */
up_read(&mm->mmap_sem);
- if (is_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk("VM: killing process %s\n", current->comm);
- if (user_mode(regs))
- do_exit(SIGKILL);
- return SIGKILL;
+ pagefault_out_of_memory();
+ return 0;

do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/x86_64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/x86_64/mm/fault.c
+++ linux-2.6/arch/x86_64/mm/fault.c
@@ -407,7 +407,6 @@ asmlinkage void __kprobes do_page_fault(
if (unlikely(in_atomic() || !mm))
goto bad_area_nosemaphore;

- again:
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunatly, in the case of an
@@ -574,20 +573,14 @@ no_context:
oops_end(flags);
do_exit(SIGKILL);

-/*
- * We ran out of memory, or some other thing happened to us that made
- * us unable to handle the page fault gracefully.
- */
out_of_memory:
+ /*
+ * We ran out of memory, call the OOM killer, and return to userspace
+ * (the fault will be retried if we weren't killed)
+ */
up_read(&mm->mmap_sem);
- if (is_init(current)) {
- yield();
- goto again;
- }
- printk("VM: killing process %s\n", tsk->comm);
- if (error_code & 4)
- do_exit(SIGKILL);
- goto no_context;
+ pagefault_out_of_memory();
+ return;

do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -617,6 +617,11 @@ static inline int page_mapped(struct pag
*/
#define VM_FAULT_WRITE 0x10

+/*
+ * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.
+ */
+extern void pagefault_out_of_memory(void);
+
#define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK)

extern void show_free_areas(void);
Index: linux-2.6/arch/um/kernel/trap.c
===================================================================
--- linux-2.6.orig/arch/um/kernel/trap.c
+++ linux-2.6/arch/um/kernel/trap.c
@@ -75,7 +75,6 @@ good_area:
goto out;

do {
-survive:
switch (handle_mm_fault(mm, vma, address, is_write)){
case VM_FAULT_MINOR:
current->min_flt++;
@@ -119,13 +118,13 @@ out_nosemaphore:
* us unable to handle the page fault gracefully.
*/
out_of_memory:
- if (is_init(current)) {
- up_read(&mm->mmap_sem);
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- goto out;
+ /*
+ * We ran out of memory, call the OOM killer, and return to userspace
+ * (the fault will be retried if we weren't killed)
+ */
+ up_read(&mm->mmap_sem);
+ pagefault_out_of_memory();
+ goto out_nosemaphore;
}

void segv_handler(int sig, union uml_pt_regs *regs)

2006-10-12 14:10:24

by Nick Piggin

[permalink] [raw]

Subject: [patch 4/5] mm: incorrect VM_FAULT_OOM returns from drivers

Some drivers are returning OOM when it is not in response to a memory
shortage.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/drivers/char/drm/drm_vm.c
===================================================================
--- linux-2.6.orig/drivers/char/drm/drm_vm.c
+++ linux-2.6/drivers/char/drm/drm_vm.c
@@ -147,14 +147,14 @@ static __inline__ struct page *drm_do_vm
if (address > vma->vm_end)
return NOPAGE_SIGBUS; /* Disallow mremap */
if (!map)
- return NOPAGE_OOM; /* Nothing allocated */
+ return NOPAGE_SIGBUS; /* Nothing allocated */

offset = address - vma->vm_start;
i = (unsigned long)map->handle + offset;
page = (map->type == _DRM_CONSISTENT) ?
virt_to_page((void *)i) : vmalloc_to_page((void *)i);
if (!page)
- return NOPAGE_OOM;
+ return NOPAGE_SIGBUS;
get_page(page);

DRM_DEBUG("shm_nopage 0x%lx\n", address);
@@ -272,7 +272,7 @@ static __inline__ struct page *drm_do_vm
if (address > vma->vm_end)
return NOPAGE_SIGBUS; /* Disallow mremap */
if (!dma->pagelist)
- return NOPAGE_OOM; /* Nothing allocated */
+ return NOPAGE_SIGBUS; /* Nothing allocated */

offset = address - vma->vm_start; /* vm_[pg]off[set] should be 0 */
page_nr = offset >> PAGE_SHIFT;
@@ -310,7 +310,7 @@ static __inline__ struct page *drm_do_vm
if (address > vma->vm_end)
return NOPAGE_SIGBUS; /* Disallow mremap */
if (!entry->pagelist)
- return NOPAGE_OOM; /* Nothing allocated */
+ return NOPAGE_SIGBUS; /* Nothing allocated */

offset = address - vma->vm_start;
map_offset = map->offset - (unsigned long)dev->sg->virtual;
Index: linux-2.6/sound/core/pcm_native.c
===================================================================
--- linux-2.6.orig/sound/core/pcm_native.c
+++ linux-2.6/sound/core/pcm_native.c
@@ -3025,7 +3025,7 @@ static struct page * snd_pcm_mmap_status
struct page * page;

if (substream == NULL)
- return NOPAGE_OOM;
+ return NOPAGE_SIGBUS;
runtime = substream->runtime;
page = virt_to_page(runtime->status);
get_page(page);
@@ -3068,7 +3068,7 @@ static struct page * snd_pcm_mmap_contro
struct page * page;

if (substream == NULL)
- return NOPAGE_OOM;
+ return NOPAGE_SIGBUS;
runtime = substream->runtime;
page = virt_to_page(runtime->control);
get_page(page);
@@ -3129,18 +3129,18 @@ static struct page *snd_pcm_mmap_data_no
size_t dma_bytes;

if (substream == NULL)
- return NOPAGE_OOM;
+ return NOPAGE_SIGBUS;
runtime = substream->runtime;
offset = area->vm_pgoff << PAGE_SHIFT;
offset += address - area->vm_start;
- snd_assert((offset % PAGE_SIZE) == 0, return NOPAGE_OOM);
+ snd_assert((offset % PAGE_SIZE) == 0, return NOPAGE_SIGBUS);
dma_bytes = PAGE_ALIGN(runtime->dma_bytes);
if (offset > dma_bytes - PAGE_SIZE)
return NOPAGE_SIGBUS;
if (substream->ops->page) {
page = substream->ops->page(substream, offset);
if (! page)
- return NOPAGE_OOM;
+ return NOPAGE_OOM; /* XXX: is this really due to OOM? */
} else {
vaddr = runtime->dma_area + offset;
page = virt_to_page(vaddr);
Index: linux-2.6/sound/oss/via82cxxx_audio.c
===================================================================
--- linux-2.6.orig/sound/oss/via82cxxx_audio.c
+++ linux-2.6/sound/oss/via82cxxx_audio.c
@@ -2120,8 +2120,8 @@ static struct page * via_mm_nopage (stru
return NOPAGE_SIGBUS; /* Disallow mremap */
}
if (!card) {
- DPRINTK ("EXIT, returning NOPAGE_OOM\n");
- return NOPAGE_OOM; /* Nothing allocated */
+ DPRINTK ("EXIT, returning NOPAGE_SIGBUS\n");
+ return NOPAGE_SIGBUS; /* Nothing allocated */
}

pgoff = vma->vm_pgoff + ((address - vma->vm_start) >> PAGE_SHIFT);
Index: linux-2.6/sound/usb/usx2y/usX2Yhwdep.c
===================================================================
--- linux-2.6.orig/sound/usb/usx2y/usX2Yhwdep.c
+++ linux-2.6/sound/usb/usx2y/usX2Yhwdep.c
@@ -48,7 +48,7 @@ static struct page * snd_us428ctls_vm_no

offset = area->vm_pgoff << PAGE_SHIFT;
offset += address - area->vm_start;
- snd_assert((offset % PAGE_SIZE) == 0, return NOPAGE_OOM);
+ snd_assert((offset % PAGE_SIZE) == 0, return NOPAGE_SIGBUS);
vaddr = (char*)((struct usX2Ydev *)area->vm_private_data)->us428ctls_sharedmem + offset;
page = virt_to_page(vaddr);
get_page(page);

2006-10-12 15:06:28

by Kirill Korotaev

[permalink] [raw]

Subject: Re: [patch 5/5] oom: invoke OOM killer from pagefault handler

Nick,

AFAICS, 1 page allocation which is done in page fault handler
can fail in the only case - OOM kills current, so if we failed
we should have TIF_MEMDIE and just kill current.
Selecting another process for killing if page fault fails means
taking another victim with the one being already killed.

my 2 cents.

Thanks,
Kirill

> Rather than have the pagefault handler kill a process directly if it gets a
> VM_FAULT_OOM, have it call into the OOM killer.
>
> Only converted a few architectures so far - this is just an RFC.
>
> Index: linux-2.6/mm/oom_kill.c
> ===================================================================
> --- linux-2.6.orig/mm/oom_kill.c
> +++ linux-2.6/mm/oom_kill.c
> @@ -376,6 +376,57 @@ int unregister_oom_notifier(struct notif
> }
> EXPORT_SYMBOL_GPL(unregister_oom_notifier);
>
> +/*
> + * Must be called with cpuset_lock and tasklist_lock held for read.
> + */
> +void __out_of_memory(void)
> +{
> + unsigned long points = 0;
> + struct task_struct *p;
> +
> + if (sysctl_panic_on_oom)
> + panic("out of memory. panic_on_oom is selected\n");
> +retry:
> + /*
> + * Rambo mode: Shoot down a process and hope it solves whatever
> + * issues we may have.
> + */
> + p = select_bad_process(&points);
> +
> + if (PTR_ERR(p) == -1UL)
> + return;
> +
> + /* Found nothing?!?! Either we hang forever, or we panic. */
> + if (!p) {
> + read_unlock(&tasklist_lock);
> + cpuset_unlock();
> + panic("Out of memory and no killable processes...\n");
> + }
> +
> + if (oom_kill_process(p, points, "Out of memory"))
> + goto retry;
> +}
> +
> +/*
> + * pagefault handler calls into here because it is out of memory but
> + * doesn't know exactly how or why.
> + */
> +void pagefault_out_of_memory(void)
> +{
> + if (printk_ratelimit()) {
> + printk(KERN_WARNING "%s invoked oom-killer from pagefault: "
> + "oomkilladj=%d\n", current->oomkilladj);
> + dump_stack();
> + show_mem();
> + }
> +
> + cpuset_lock();
> + read_lock(&tasklist_lock);
> + __out_of_memory();
> + read_unlock(&tasklist_lock);
> + cpuset_unlock();
> +}
> +
> /**
> * out_of_memory - kill the "best" process when we run out of memory
> *
> @@ -386,8 +437,6 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifie
> */
> void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
> {
> - struct task_struct *p;
> - unsigned long points = 0;
> unsigned long freed = 0;
>
> blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> @@ -412,42 +461,18 @@ void out_of_memory(struct zonelist *zone
> */
> switch (constrained_alloc(zonelist, gfp_mask)) {
> case CONSTRAINT_MEMORY_POLICY:
> - oom_kill_process(current, points,
> - "No available memory (MPOL_BIND)");
> + oom_kill_process(current, 0, "No available memory (MPOL_BIND)");
> break;
>
> case CONSTRAINT_CPUSET:
> - oom_kill_process(current, points,
> - "No available memory in cpuset");
> + oom_kill_process(current, 0, "No available memory in cpuset");
> break;
>
> case CONSTRAINT_NONE:
> - if (sysctl_panic_on_oom)
> - panic("out of memory. panic_on_oom is selected\n");
> -retry:
> - /*
> - * Rambo mode: Shoot down a process and hope it solves whatever
> - * issues we may have.
> - */
> - p = select_bad_process(&points);
> -
> - if (PTR_ERR(p) == -1UL)
> - goto out;
> -
> - /* Found nothing?!?! Either we hang forever, or we panic. */
> - if (!p) {
> - read_unlock(&tasklist_lock);
> - cpuset_unlock();
> - panic("Out of memory and no killable processes...\n");
> - }
> -
> - if (oom_kill_process(p, points, "Out of memory"))
> - goto retry;
> -
> + __out_of_memory();
> break;
> }
>
> -out:
> read_unlock(&tasklist_lock);
> cpuset_unlock();
>
> Index: linux-2.6/arch/alpha/mm/fault.c
> ===================================================================
> --- linux-2.6.orig/arch/alpha/mm/fault.c
> +++ linux-2.6/arch/alpha/mm/fault.c
> @@ -143,7 +143,6 @@ do_page_fault(unsigned long address, uns
> goto bad_area;
> }
>
> - survive:
> /* If for any reason at all we couldn't handle the fault,
> make sure we exit gracefully rather than endlessly redo
> the fault. */
> @@ -190,19 +189,13 @@ do_page_fault(unsigned long address, uns
> die_if_kernel("Oops", regs, cause, (unsigned long*)regs - 16);
> do_exit(SIGKILL);
>
> - /* We ran out of memory, or some other thing happened to us that
> - made us unable to handle the page fault gracefully. */
> + /*
> + * We ran out of memory, call the OOM killer, and return to userspace
> + * (the fault will be retried if we weren't killed)
> + */
> out_of_memory:
> - if (is_init(current)) {
> - yield();
> - down_read(&mm->mmap_sem);
> - goto survive;
> - }
> - printk(KERN_ALERT "VM: killing process %s(%d)\n",
> - current->comm, current->pid);
> - if (!user_mode(regs))
> - goto no_context;
> - do_exit(SIGKILL);
> + pagefault_out_of_memory();
> + return;
>
> do_sigbus:
> /* Send a sigbus, regardless of whether we were in kernel
> Index: linux-2.6/arch/i386/mm/fault.c
> ===================================================================
> --- linux-2.6.orig/arch/i386/mm/fault.c
> +++ linux-2.6/arch/i386/mm/fault.c
> @@ -444,7 +444,6 @@ good_area:
> goto bad_area;
> }
>
> - survive:
> /*
> * If for any reason at all we couldn't handle the fault,
> * make sure we exit gracefully rather than endlessly redo
> @@ -583,21 +582,14 @@ no_context:
> bust_spinlocks(0);
> do_exit(SIGKILL);
>
> -/*
> - * We ran out of memory, or some other thing happened to us that made
> - * us unable to handle the page fault gracefully.
> - */
> out_of_memory:
> + /*
> + * We ran out of memory, call the OOM killer, and return to userspace
> + * (the fault will be retried if we weren't killed)
> + */
> up_read(&mm->mmap_sem);
> - if (is_init(tsk)) {
> - yield();
> - down_read(&mm->mmap_sem);
> - goto survive;
> - }
> - printk("VM: killing process %s\n", tsk->comm);
> - if (error_code & 4)
> - do_exit(SIGKILL);
> - goto no_context;
> + pagefault_out_of_memory();
> + return;
>
> do_sigbus:
> up_read(&mm->mmap_sem);
> Index: linux-2.6/arch/ia64/mm/fault.c
> ===================================================================
> --- linux-2.6.orig/arch/ia64/mm/fault.c
> +++ linux-2.6/arch/ia64/mm/fault.c
> @@ -155,7 +155,6 @@ ia64_do_page_fault (unsigned long addres
> if ((vma->vm_flags & mask) != mask)
> goto bad_area;
>
> - survive:
> /*
> * If for any reason at all we couldn't handle the fault, make
> * sure we exit gracefully rather than endlessly redo the
> @@ -280,13 +279,10 @@ ia64_do_page_fault (unsigned long addres
>
> out_of_memory:
> up_read(&mm->mmap_sem);
> - if (is_init(current)) {
> - yield();
> - down_read(&mm->mmap_sem);
> - goto survive;
> - }
> - printk(KERN_CRIT "VM: killing process %s\n", current->comm);
> - if (user_mode(regs))
> - do_exit(SIGKILL);
> - goto no_context;
> + /*
> + * We ran out of memory, call the OOM killer, and return to userspace
> + * (the fault will be retried if we weren't killed)
> + */
> + pagefault_out_of_memory();
> + return;
> }
> Index: linux-2.6/arch/powerpc/mm/fault.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/fault.c
> +++ linux-2.6/arch/powerpc/mm/fault.c
> @@ -342,7 +342,6 @@ good_area:
> * make sure we exit gracefully rather than endlessly redo
> * the fault.
> */
> - survive:
> switch (handle_mm_fault(mm, vma, address, is_write)) {
>
> case VM_FAULT_MINOR:
> @@ -380,21 +379,14 @@ bad_area_nosemaphore:
>
> return SIGSEGV;
>
> -/*
> - * We ran out of memory, or some other thing happened to us that made
> - * us unable to handle the page fault gracefully.
> - */
> out_of_memory:
> + /*
> + * We ran out of memory, call the OOM killer, and return to userspace
> + * (the fault will be retried if we weren't killed)
> + */
> up_read(&mm->mmap_sem);
> - if (is_init(current)) {
> - yield();
> - down_read(&mm->mmap_sem);
> - goto survive;
> - }
> - printk("VM: killing process %s\n", current->comm);
> - if (user_mode(regs))
> - do_exit(SIGKILL);
> - return SIGKILL;
> + pagefault_out_of_memory();
> + return 0;
>
> do_sigbus:
> up_read(&mm->mmap_sem);
> Index: linux-2.6/arch/x86_64/mm/fault.c
> ===================================================================
> --- linux-2.6.orig/arch/x86_64/mm/fault.c
> +++ linux-2.6/arch/x86_64/mm/fault.c
> @@ -407,7 +407,6 @@ asmlinkage void __kprobes do_page_fault(
> if (unlikely(in_atomic() || !mm))
> goto bad_area_nosemaphore;
>
> - again:
> /* When running in the kernel we expect faults to occur only to
> * addresses in user space. All other faults represent errors in the
> * kernel and should generate an OOPS. Unfortunatly, in the case of an
> @@ -574,20 +573,14 @@ no_context:
> oops_end(flags);
> do_exit(SIGKILL);
>
> -/*
> - * We ran out of memory, or some other thing happened to us that made
> - * us unable to handle the page fault gracefully.
> - */
> out_of_memory:
> + /*
> + * We ran out of memory, call the OOM killer, and return to userspace
> + * (the fault will be retried if we weren't killed)
> + */
> up_read(&mm->mmap_sem);
> - if (is_init(current)) {
> - yield();
> - goto again;
> - }
> - printk("VM: killing process %s\n", tsk->comm);
> - if (error_code & 4)
> - do_exit(SIGKILL);
> - goto no_context;
> + pagefault_out_of_memory();
> + return;
>
> do_sigbus:
> up_read(&mm->mmap_sem);
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -617,6 +617,11 @@ static inline int page_mapped(struct pag
> */
> #define VM_FAULT_WRITE 0x10
>
> +/*
> + * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.
> + */
> +extern void pagefault_out_of_memory(void);
> +
> #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK)
>
> extern void show_free_areas(void);
> Index: linux-2.6/arch/um/kernel/trap.c
> ===================================================================
> --- linux-2.6.orig/arch/um/kernel/trap.c
> +++ linux-2.6/arch/um/kernel/trap.c
> @@ -75,7 +75,6 @@ good_area:
> goto out;
>
> do {
> -survive:
> switch (handle_mm_fault(mm, vma, address, is_write)){
> case VM_FAULT_MINOR:
> current->min_flt++;
> @@ -119,13 +118,13 @@ out_nosemaphore:
> * us unable to handle the page fault gracefully.
> */
> out_of_memory:
> - if (is_init(current)) {
> - up_read(&mm->mmap_sem);
> - yield();
> - down_read(&mm->mmap_sem);
> - goto survive;
> - }
> - goto out;
> + /*
> + * We ran out of memory, call the OOM killer, and return to userspace
> + * (the fault will be retried if we weren't killed)
> + */
> + up_read(&mm->mmap_sem);
> + pagefault_out_of_memory();
> + goto out_nosemaphore;
> }
>
> void segv_handler(int sig, union uml_pt_regs *regs)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-10-12 15:19:17

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 5/5] oom: invoke OOM killer from pagefault handler

On Thu, Oct 12, 2006 at 07:12:13PM +0400, Kirill Korotaev wrote:
> Nick,
>
> AFAICS, 1 page allocation which is done in page fault handler
> can fail in the only case - OOM kills current, so if we failed
> we should have TIF_MEMDIE and just kill current.
> Selecting another process for killing if page fault fails means
> taking another victim with the one being already killed.
>

Hi Kirill,

I don't quite understand you. If the page allocation fails in the
fault handler, we don't want to kill current if it is marked as
OOM_DISABLE or sysctl_panic_on_oom is set... imagine a critical
service in a failover system.

It should be quite likely for another process to be kiled and
provide enough memory to keep the system running. Presuming you
have faith in the concept of the OOM killer ;)

Cheers,
Nick

2006-10-12 22:00:58

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 1/5] oom: don't kill unkillable children or siblings

On Thu, 12 Oct 2006 16:09:43 +0200 (CEST)
Nick Piggin <[email protected]> wrote:

> Abort the kill if any of our threads have OOM_DISABLE set. Having this test
> here also prevents any OOM_DISABLE child of the "selected" process from being
> killed.
>
> Signed-off-by: Nick Piggin <[email protected]>
>
> Index: linux-2.6/mm/oom_kill.c
> ===================================================================
> --- linux-2.6.orig/mm/oom_kill.c
> +++ linux-2.6/mm/oom_kill.c
> @@ -312,15 +312,24 @@ static int oom_kill_task(struct task_str
> if (mm == NULL)
> return 1;
>
> + /*
> + * Don't kill the process if any threads are set to OOM_DISABLE
> + */
> + do_each_thread(g, q) {
> + if (q->mm == mm && p->oomkilladj == OOM_DISABLE)
> + return 1;
> + } while_each_thread(g, q);
> +
> __oom_kill_task(p, message);
> +
> /*
> * kill all processes that share the ->mm (i.e. all threads),
> * but are in a different thread group
> */
> - do_each_thread(g, q)
> + do_each_thread(g, q) {
> if (q->mm == mm && q->tgid != p->tgid)
> __oom_kill_task(q, message);
> - while_each_thread(g, q);
> + } while_each_thread(g, q);
>
> return 0;

One wonders whether OOM_DISABLE should be a property of the mm_struct, not
of the task_struct.

2006-10-12 22:04:00

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 3/5] oom: less memdie

On Thu, 12 Oct 2006 16:10:01 +0200 (CEST)
Nick Piggin <[email protected]> wrote:

> Don't cause all threads in all other thread groups to gain TIF_MEMDIE
> otherwise we'll get a thundering herd eating out memory reserve. This
> may not be the optimal scheme, but it fits our policy of allowing just
> one TIF_MEMDIE in the system at once.
>
> Signed-off-by: Nick Piggin <[email protected]>
>
> Index: linux-2.6/mm/oom_kill.c
> ===================================================================
> --- linux-2.6.orig/mm/oom_kill.c
> +++ linux-2.6/mm/oom_kill.c
> @@ -322,11 +322,12 @@ static int oom_kill_task(struct task_str
>
> /*
> * kill all processes that share the ->mm (i.e. all threads),
> - * but are in a different thread group.
> + * but are in a different thread group. Don't let them have access
> + * to memory reserves though, otherwise we might deplete all memory.
> */
> do_each_thread(g, q) {
> if (q->mm == mm && q->tgid != p->tgid)
> - __oom_kill_task(q, 1);
> + force_sig(SIGKILL, p);
> } while_each_thread(g, q);
>

Curious. How much testing did you do of this stuff? I assume there were
some observed problems. What were they, and what was the observed effect
of these changes?

Thanks.

2006-10-12 22:10:24

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 5/5] oom: invoke OOM killer from pagefault handler

On Thu, 12 Oct 2006 17:19:07 +0200
Nick Piggin <[email protected]> wrote:

> On Thu, Oct 12, 2006 at 07:12:13PM +0400, Kirill Korotaev wrote:
> > Nick,
> >
> > AFAICS, 1 page allocation which is done in page fault handler
> > can fail in the only case - OOM kills current, so if we failed
> > we should have TIF_MEMDIE and just kill current.
> > Selecting another process for killing if page fault fails means
> > taking another victim with the one being already killed.
> >
>
> Hi Kirill,
>
> I don't quite understand you.

Kirill is claiming that the only occasion on which a pagefault handler would
get an oom is when it killed itself in the oom handler.

> If the page allocation fails in the
> fault handler, we don't want to kill current if it is marked as
> OOM_DISABLE or sysctl_panic_on_oom is set... imagine a critical
> service in a failover system.
>
> It should be quite likely for another process to be kiled and
> provide enough memory to keep the system running. Presuming you
> have faith in the concept of the OOM killer ;)

I'm a bit wobbly about this one. Some before-and-after testing results
would help things along..

2006-10-13 06:32:06

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 1/5] oom: don't kill unkillable children or siblings

Andrew Morton wrote:

>On Thu, 12 Oct 2006 16:09:43 +0200 (CEST)
>Nick Piggin <[email protected]> wrote:
>
>
>>Abort the kill if any of our threads have OOM_DISABLE set. Having this test
>>here also prevents any OOM_DISABLE child of the "selected" process from being
>>killed.
>>
>>Signed-off-by: Nick Piggin <[email protected]>
>>
>>Index: linux-2.6/mm/oom_kill.c
>>===================================================================
>>--- linux-2.6.orig/mm/oom_kill.c
>>+++ linux-2.6/mm/oom_kill.c
>>@@ -312,15 +312,24 @@ static int oom_kill_task(struct task_str
>> if (mm == NULL)
>> return 1;
>>
>>+ /*
>>+ * Don't kill the process if any threads are set to OOM_DISABLE
>>+ */
>>+ do_each_thread(g, q) {
>>+ if (q->mm == mm && p->oomkilladj == OOM_DISABLE)
>>+ return 1;
>>+ } while_each_thread(g, q);
>>+
>> __oom_kill_task(p, message);
>>+
>> /*
>> * kill all processes that share the ->mm (i.e. all threads),
>> * but are in a different thread group
>> */
>>- do_each_thread(g, q)
>>+ do_each_thread(g, q) {
>> if (q->mm == mm && q->tgid != p->tgid)
>> __oom_kill_task(q, message);
>>- while_each_thread(g, q);
>>+ } while_each_thread(g, q);
>>
>> return 0;
>>
>
>One wonders whether OOM_DISABLE should be a property of the mm_struct, not
>of the task_struct.
>

Hmm... I don't think I could argue with that. I think this patch is needed
in the meantime though.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-13 06:38:27

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 3/5] oom: less memdie

Andrew Morton wrote:

>On Thu, 12 Oct 2006 16:10:01 +0200 (CEST)
>Nick Piggin <[email protected]> wrote:
>
>
>>Don't cause all threads in all other thread groups to gain TIF_MEMDIE
>>otherwise we'll get a thundering herd eating out memory reserve. This
>>may not be the optimal scheme, but it fits our policy of allowing just
>>one TIF_MEMDIE in the system at once.
>>
>>Signed-off-by: Nick Piggin <[email protected]>
>>
>>Index: linux-2.6/mm/oom_kill.c
>>===================================================================
>>--- linux-2.6.orig/mm/oom_kill.c
>>+++ linux-2.6/mm/oom_kill.c
>>@@ -322,11 +322,12 @@ static int oom_kill_task(struct task_str
>>
>> /*
>> * kill all processes that share the ->mm (i.e. all threads),
>>- * but are in a different thread group.
>>+ * but are in a different thread group. Don't let them have access
>>+ * to memory reserves though, otherwise we might deplete all memory.
>> */
>> do_each_thread(g, q) {
>> if (q->mm == mm && q->tgid != p->tgid)
>>- __oom_kill_task(q, 1);
>>+ force_sig(SIGKILL, p);
>> } while_each_thread(g, q);
>>
>>
>
>Curious. How much testing did you do of this stuff? I assume there were
>some observed problems. What were they, and what was the observed effect
>of these changes?
>

This change I actually didn't really test because I don't have any apps
to speak
of which use multiple thread groups.

I stumbled on it by inspection when trying to fix the killing of
OOM_DISABLE tasks.
Basically -- we don't set TIF_MEMDIE or boost the priority of *any*
other thread in
our same group, so we shouldn't do it for *all* other threds of all
other groups.

Consider an OOM situation, where there will likely be a lot of threads
stuck in
__alloc_pages. If a large number of these suddenly get a big timeslice
and full
access to memory reserves, they'll eat into more than we'd like.

Now I'm not sure that our current OOM killing / memory reserving scheme is
perfect -- indeed if we only allow a single TIF_MEMDIE thread at once,
we can
get deadlocks. However, that's the direction we've chosen, and it seems
to work
reasonably well. Mostly.

This is just an enforcement of that policy rather than a change in
direction.
Criticism is always welcome though.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-13 06:45:54

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 5/5] oom: invoke OOM killer from pagefault handler

Andrew Morton wrote:

>On Thu, 12 Oct 2006 17:19:07 +0200
>Nick Piggin <[email protected]> wrote:
>
>
>>On Thu, Oct 12, 2006 at 07:12:13PM +0400, Kirill Korotaev wrote:
>>
>>>Nick,
>>>
>>>AFAICS, 1 page allocation which is done in page fault handler
>>>can fail in the only case - OOM kills current, so if we failed
>>>we should have TIF_MEMDIE and just kill current.
>>>Selecting another process for killing if page fault fails means
>>>taking another victim with the one being already killed.
>>>
>>>
>>Hi Kirill,
>>
>>I don't quite understand you.
>>
>
>Kirill is claiming that the only occasion on which a pagefault handler would
>get an oom is when it killed itself in the oom handler.
>

Well I don't think that should happen much. When the process gets OOM
killed,
it is given full access to all memory reserves, so it will be _less_ likely
to go OOM maybe.

Actually if you work it through, maybe that isn't the case -- our
infinite retry
logic in the allocator means that non OOM killed tasks will never return
NULL,
while the OOM task might just use up every single free page in the
system and
will eventually return NULL. In this case the system is probably on
death's door
though, so I don't know if it is worth worrying about.

>>If the page allocation fails in the
>>fault handler, we don't want to kill current if it is marked as
>>OOM_DISABLE or sysctl_panic_on_oom is set... imagine a critical
>>service in a failover system.
>>
>>It should be quite likely for another process to be kiled and
>>provide enough memory to keep the system running. Presuming you
>>have faith in the concept of the OOM killer ;)
>>
>
>I'm a bit wobbly about this one. Some before-and-after testing results
>would help things along..
>

I can force VM_FAULT_OOMs to happen, but it is difficult to make it
happen in
the real world because most fault handling paths don't allocate higher order
allocations.

What I especially have in mind here is the OOM_DISABLE and panic_on_oom
sysctl
rather than expecting particularly much better general oom killing
behaviour.
Suppose you have a critical failover node or heartbeat process or something
where you'd rather the system to panic and reboot instead of doing something
silly...

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-13 06:47:53

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 5/5] oom: invoke OOM killer from pagefault handler

Nick Piggin wrote:

> What I especially have in mind here is the OOM_DISABLE and
> panic_on_oom sysctl
> rather than expecting particularly much better general oom killing
> behaviour.
> Suppose you have a critical failover node or heartbeat process or
> something
> where you'd rather the system to panic and reboot instead of doing
> something
> silly...

Oh, I already said that.

Well anyway, I'm not sure exactly how people use these tunables, but I
expect
those that do, _really_ want them to work.

--

Send instant messages to your online friends http://au.messenger.yahoo.com