2010-06-03 05:48:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: [mmotm 0521][PATCH 0/12] various OOM fixes for 2.6.35

Hi

This patch series is collection of various OOM bugfixes. I think
all of patches can send to 2.6.35.
Recently, David Rientjes and Luis Claudio R. Goncalves posted other
various imporovement. I'll collect such 2.6.36 items and I plan to
push -mm at next week.

patch lists
-------------------------------------
oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
oom: introduce find_lock_task_mm() to fix !mm false positives
oom: the points calculation of child processes must use find_lock_task_mm() too
oom: __oom_kill_task() must use find_lock_task_mm() too
oom: make oom_unkillable() helper function
oom: remove warning for in mm-less task __oom_kill_process()
oom: Fix child process iteration properly
oom: dump_tasks() use find_lock_task_mm() too
oom: remove PF_EXITING check completely
oom: sacrifice child with highest badness score for parent
oom: remove special handling for pagefault ooms
oom: give current access to memory reserves if it has been killed

diffstat
------------
mm/oom_kill.c | 303 ++++++++++++++++++++++++++++++--------------------------
1 files changed, 162 insertions(+), 141 deletions(-)



Changes since last post
-------------------------
- Drop Luis's "give the dying task a higher priority" patch
- Add "remove PF_EXITING check completely" patch
- Drop Oleg's "oom: select_bad_process: PF_EXITING check should
take ->mm into account" because conflict against "remove
PF_EXITING check completely"
- Add "oom: sacrifice child with highest badness score for parent"
- Add "oom: remove special handling for pagefault ooms"
- Add "oom: give current access to memory reserves if it has been killed"




2010-06-03 05:49:53

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 01/12] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads

From: Oleg Nesterov <[email protected]>

select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
is not true due to use_mm().

Change the code to check PF_KTHREAD.

Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 9 +++------
1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 709aedf..070b713 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -256,14 +256,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
for_each_process(p) {
unsigned long points;

- /*
- * skip kernel threads and tasks which have already released
- * their mm.
- */
+ /* skip the tasks which have already released their mm. */
if (!p->mm)
continue;
- /* skip the init task */
- if (is_global_init(p))
+ /* skip the init task and kthreads */
+ if (is_global_init(p) || (p->flags & PF_KTHREAD))
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
--
1.6.5.2


2010-06-03 05:50:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 02/12] oom: introduce find_lock_task_mm() to fix !mm false positives

From: Oleg Nesterov <[email protected]>

Almost all ->mm == NUL checks in oom_kill.c are wrong.

The current code assumes that the task without ->mm has already
released its memory and ignores the process. However this is not
necessarily true when this process is multithreaded, other live
sub-threads can use this ->mm.

- Remove the "if (!p->mm)" check in select_bad_process(), it is
just wrong.

- Add the new helper, find_lock_task_mm(), which finds the live
thread which uses the memory and takes task_lock() to pin ->mm

- change oom_badness() to use this helper instead of just checking
->mm != NULL.

- As David pointed out, select_bad_process() must never choose the
task without ->mm, but no matter what badness() returns the
task can be chosen if nothing else has been found yet.

Note! This patch is not enough, we need more changes.

- badness() was fixed, but oom_kill_task() still ignores
the task without ->mm

This will be addressed later.

Signed-off-by: Oleg Nesterov <[email protected]>
Cc: David Rientjes <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 28 +++++++++++++++++-----------
1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 070b713..ce9c744 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -52,6 +52,19 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
return 0;
}

+static struct task_struct *find_lock_task_mm(struct task_struct *p)
+{
+ struct task_struct *t = p;
+ do {
+ task_lock(t);
+ if (likely(t->mm))
+ return t;
+ task_unlock(t);
+ } while_each_thread(p, t);
+
+ return NULL;
+}
+
/**
* badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
@@ -74,7 +87,6 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
unsigned long badness(struct task_struct *p, unsigned long uptime)
{
unsigned long points, cpu_time, run_time;
- struct mm_struct *mm;
struct task_struct *child;
int oom_adj = p->signal->oom_adj;
struct task_cputime task_time;
@@ -84,17 +96,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
if (oom_adj == OOM_DISABLE)
return 0;

- task_lock(p);
- mm = p->mm;
- if (!mm) {
- task_unlock(p);
+ p = find_lock_task_mm(p);
+ if (!p)
return 0;
- }

/*
* The memory size of the process is the basis for the badness.
*/
- points = mm->total_vm;
+ points = p->mm->total_vm;

/*
* After this unlock we can no longer dereference local variable `mm'
@@ -117,7 +126,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
*/
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
- if (child->mm != mm && child->mm)
+ if (child->mm != p->mm && child->mm)
points += child->mm->total_vm/2 + 1;
task_unlock(child);
}
@@ -256,9 +265,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
for_each_process(p) {
unsigned long points;

- /* skip the tasks which have already released their mm. */
- if (!p->mm)
- continue;
/* skip the init task and kthreads */
if (is_global_init(p) || (p->flags & PF_KTHREAD))
continue;
--
1.6.5.2


2010-06-03 05:51:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 03/12] oom: the points calculation of child processes must use find_lock_task_mm() too

child point calclation use find_lock_task_mm() too.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
---
mm/oom_kill.c | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ce9c744..0542232 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -87,6 +87,7 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
unsigned long badness(struct task_struct *p, unsigned long uptime)
{
unsigned long points, cpu_time, run_time;
+ struct task_struct *c;
struct task_struct *child;
int oom_adj = p->signal->oom_adj;
struct task_cputime task_time;
@@ -124,11 +125,13 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
* child is eating the vast majority of memory, adding only half
* to the parents will make the child our kill candidate of choice.
*/
- list_for_each_entry(child, &p->children, sibling) {
- task_lock(child);
- if (child->mm != p->mm && child->mm)
- points += child->mm->total_vm/2 + 1;
- task_unlock(child);
+ list_for_each_entry(c, &p->children, sibling) {
+ child = find_lock_task_mm(c);
+ if (child) {
+ if (child->mm != p->mm)
+ points += child->mm->total_vm/2 + 1;
+ task_unlock(child);
+ }
}

/*
--
1.6.5.2


2010-06-03 05:52:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 04/12] oom: __oom_kill_task() must use find_lock_task_mm() too


__oom_kill_task also use find_lock_task_mm(). because if sysctl_oom_kill_allocating_task
is true, __out_of_memory() don't call select_bad_process().

Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/oom_kill.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0542232..79845f4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -394,12 +394,11 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
return;
}

- task_lock(p);
- if (!p->mm) {
+ p = find_lock_task_mm(p);
+ if (!p) {
WARN_ON(1);
printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
task_pid_nr(p), p->comm);
- task_unlock(p);
return;
}

--
1.6.5.2


2010-06-03 05:53:54

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 05/12] oom: make oom_unkillable() helper function

Now, sysctl_oom_kill_allocating_task case and CONSTRAINT_MEMORY_POLICY
case don't call select_bad_process(). then, oom_kill_process() need
very similar unkillable tasks.

This patch unify it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 62 ++++++++++++++++++++++++--------------------------------
1 files changed, 27 insertions(+), 35 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 79845f4..618cf44 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -250,6 +250,21 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
}
#endif

+static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem)
+{
+ /* skip the init task and kthreads */
+ if (is_global_init(p) || (p->flags & PF_KTHREAD))
+ return 1;
+
+ if (mem && !task_in_mem_cgroup(p, mem))
+ return 1;
+
+ if (p->signal->oom_adj == OOM_DISABLE)
+ return 1;
+
+ return 0;
+}
+
/*
* Simple selection loop. We chose the process with the highest
* number of 'points'. We expect the caller will lock the tasklist.
@@ -268,10 +283,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
for_each_process(p) {
unsigned long points;

- /* skip the init task and kthreads */
- if (is_global_init(p) || (p->flags & PF_KTHREAD))
- continue;
- if (mem && !task_in_mem_cgroup(p, mem))
+ if (oom_unkillable(p, mem))
continue;

/*
@@ -304,9 +316,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
*ppoints = ULONG_MAX;
}

- if (p->signal->oom_adj == OOM_DISABLE)
- continue;
-
points = badness(p, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
@@ -386,20 +395,18 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
* flag though it's unlikely that we select a process with CAP_SYS_RAW_IO
* set.
*/
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
+ int verbose)
{
- if (is_global_init(p)) {
- WARN_ON(1);
- printk(KERN_WARNING "tried to kill init!\n");
- return;
- }
+ if (oom_unkillable(p, mem))
+ return 1;

p = find_lock_task_mm(p);
if (!p) {
WARN_ON(1);
printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
task_pid_nr(p), p->comm);
- return;
+ return 1;
}

if (verbose)
@@ -420,22 +427,6 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
set_tsk_thread_flag(p, TIF_MEMDIE);

force_sig(SIGKILL, p);
-}
-
-static int oom_kill_task(struct task_struct *p)
-{
- /* WARNING: mm may not be dereferenced since we did not obtain its
- * value from get_task_mm(p). This is OK since all we need to do is
- * compare mm to q->mm below.
- *
- * Furthermore, even if mm contains a non-NULL value, p->mm may
- * change to NULL at any time since we do not hold task_lock(p).
- * However, this is of no concern to us.
- */
- if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
- return 1;
-
- __oom_kill_task(p, 1);

return 0;
}
@@ -454,7 +445,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
- __oom_kill_task(p, 0);
+ __oom_kill_process(p, mem, 0);
return 0;
}

@@ -465,12 +456,13 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
list_for_each_entry(c, &p->children, sibling) {
if (c->mm == p->mm)
continue;
- if (mem && !task_in_mem_cgroup(c, mem))
- continue;
- if (!oom_kill_task(c))
+
+ /* Ok, Kill the child */
+ if (!__oom_kill_process(c, mem, 1))
return 0;
}
- return oom_kill_task(p);
+
+ return __oom_kill_process(p, mem, 1);
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
--
1.6.5.2


2010-06-03 06:11:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [mmotm 0521][PATCH 0/12] various OOM fixes for 2.6.35

On Thu, Jun 3, 2010 at 2:48 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Hi
>
> This patch series is collection of various OOM bugfixes. I think
> all of patches can send to 2.6.35.
> Recently, David Rientjes and Luis Claudio R. Goncalves posted other
> various imporovement. I'll collect such 2.6.36 items and I plan to
> push -mm at next week.

Recently, there are many confusion due to patches related OOM.
You are cleaning up it and looks better than old for review.
Thanks for your great effort.


--
Kind regards,
Minchan Kim

2010-06-03 06:12:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 02/12] oom: introduce find_lock_task_mm() to fix !mm false positives

On Thu, Jun 3, 2010 at 2:50 PM, KOSAKI Motohiro
<[email protected]> wrote:
> From: Oleg Nesterov <[email protected]>
>
> Almost all ->mm == NUL checks in oom_kill.c are wrong.
>
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
>
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>  just wrong.
>
> - Add the new helper, find_lock_task_mm(), which finds the live
>  thread which uses the memory and takes task_lock() to pin ->mm
>
> - change oom_badness() to use this helper instead of just checking
>  ->mm != NULL.
>
> - As David pointed out, select_bad_process() must never choose the
>  task without ->mm, but no matter what badness() returns the
>  task can be chosen if nothing else has been found yet.
>
> Note! This patch is not enough, we need more changes.
>
>        - badness() was fixed, but oom_kill_task() still ignores
>          the task without ->mm
>
> This will be addressed later.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> Cc: David Rientjes <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Could you see my previous comment?
http://lkml.org/lkml/2010/6/2/325
Anyway, I added my review sign

Reviewed-by: Minchan Kim <[email protected]>
--
Kind regards,
Minchan Kim

2010-06-03 06:20:07

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 03/12] oom: the points calculation of child processes must use find_lock_task_mm() too

On Thu, Jun 3, 2010 at 2:51 PM, KOSAKI Motohiro
<[email protected]> wrote:
> child point calclation use find_lock_task_mm() too.

Sorry but I have to hurt you, again.
Although we guess what you want, For future, please, write down
description more clearly.

>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Acked-by: Oleg Nesterov <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>


--
Kind regards,
Minchan Kim

2010-06-03 06:23:13

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 06/12] oom: remove warning for in mm-less task __oom_kill_process()

If the race of mm detach in task exiting vs oom is happen,
find_lock_task_mm() can be return NULL.

So, the warning is pointless.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 6 +-----
1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 618cf44..d4484c5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -402,12 +402,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
return 1;

p = find_lock_task_mm(p);
- if (!p) {
- WARN_ON(1);
- printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
- task_pid_nr(p), p->comm);
+ if (!p)
return 1;
- }

if (verbose)
printk(KERN_ERR "Killed process %d (%s) "
--
1.6.5.2


2010-06-03 06:23:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 07/12] oom: Fix child process iteration properly

Oleg pointed out that current oom child process iterating logic is wrong.

> list_for_each_entry(p->children) can only see the tasks forked
> by p, it can't see other children forked by its sub-threads.

This patch fixes it.

Reported-by: Oleg Nesterov <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 35 +++++++++++++++++++++--------------
1 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d4484c5..35a2ecc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -88,6 +88,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
{
unsigned long points, cpu_time, run_time;
struct task_struct *c;
+ struct task_struct *t;
struct task_struct *child;
int oom_adj = p->signal->oom_adj;
struct task_cputime task_time;
@@ -125,14 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
* child is eating the vast majority of memory, adding only half
* to the parents will make the child our kill candidate of choice.
*/
- list_for_each_entry(c, &p->children, sibling) {
- child = find_lock_task_mm(c);
- if (child) {
- if (child->mm != p->mm)
- points += child->mm->total_vm/2 + 1;
- task_unlock(child);
+ t = p;
+ do {
+ list_for_each_entry(c, &t->children, sibling) {
+ child = find_lock_task_mm(c);
+ if (child) {
+ if (child->mm != p->mm)
+ points += child->mm->total_vm/2 + 1;
+ task_unlock(child);
+ }
}
- }
+ } while_each_thread(p, t);

/*
* CPU time is in tens of seconds and run time is in thousands
@@ -432,6 +436,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
const char *message)
{
struct task_struct *c;
+ struct task_struct *t = p;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem);
@@ -449,14 +454,16 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
message, task_pid_nr(p), p->comm, points);

/* Try to kill a child first */
- list_for_each_entry(c, &p->children, sibling) {
- if (c->mm == p->mm)
- continue;
+ do {
+ list_for_each_entry(c, &t->children, sibling) {
+ if (c->mm == p->mm)
+ continue;

- /* Ok, Kill the child */
- if (!__oom_kill_process(c, mem, 1))
- return 0;
- }
+ /* Ok, Kill the child */
+ if (!__oom_kill_process(c, mem, 1))
+ return 0;
+ }
+ } while_each_thread(p, t);

return __oom_kill_process(p, mem, 1);
}
--
1.6.5.2


2010-06-03 06:24:45

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 08/12] oom: dump_tasks() use find_lock_task_mm() too

dump_task() should use find_lock_task_mm() too. It is necessary for
protecting task-exiting race.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 37 ++++++++++++++++++++-----------------
1 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 35a2ecc..6360c56 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -345,35 +345,38 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
*/
static void dump_tasks(const struct mem_cgroup *mem)
{
- struct task_struct *g, *p;
+ struct task_struct *p;
+ struct task_struct *task;

printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj "
"name\n");
- do_each_thread(g, p) {
- struct mm_struct *mm;

- if (mem && !task_in_mem_cgroup(p, mem))
+ for_each_process(p) {
+ /*
+ * We don't have is_global_init() check here, because the old
+ * code do that. printing init process is not big matter. But
+ * we don't hope to make unnecessary compatiblity breaking.
+ */
+ if (p->flags & PF_KTHREAD)
continue;
- if (!thread_group_leader(p))
+ if (mem && !task_in_mem_cgroup(p, mem))
continue;

- task_lock(p);
- mm = p->mm;
- if (!mm) {
+ task = find_lock_task_mm(p);
+ if (!task)
/*
- * total_vm and rss sizes do not exist for tasks with no
- * mm so there's no need to report them; they can't be
- * oom killed anyway.
+ * Probably oom vs task-exiting race was happen and ->mm
+ * have been detached. thus there's no need to report them;
+ * they can't be oom killed anyway.
*/
- task_unlock(p);
continue;
- }
+
printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n",
- p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
- get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
+ task->pid, __task_cred(task)->uid, task->tgid, task->mm->total_vm,
+ get_mm_rss(task->mm), (int)task_cpu(task), task->signal->oom_adj,
p->comm);
- task_unlock(p);
- } while_each_thread(g, p);
+ task_unlock(task);
+ }
}

static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
--
1.6.5.2


2010-06-03 06:25:30

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 09/12] oom: remove PF_EXITING check completely

Currently, PF_EXITING check is completely broken. because 1) It only
care main-thread and ignore sub-threads 2) If user enable core-dump
feature, it can makes deadlock because the task during coredump ignore
SIGKILL.

The deadlock is certenaly worst result, then, minor PF_EXITING
optimization worth is relatively ignorable.

This patch removes it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
---
mm/oom_kill.c | 27 ---------------------------
1 files changed, 0 insertions(+), 27 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6360c56..5d723fb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -302,24 +302,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
if (test_tsk_thread_flag(p, TIF_MEMDIE))
return ERR_PTR(-1UL);

- /*
- * This is in the process of releasing memory so wait for it
- * to finish before killing some other task by mistake.
- *
- * However, if p is the current task, we allow the 'kill' to
- * go ahead if it is exiting: this will simply set TIF_MEMDIE,
- * which will allow it to gain access to memory reserves in
- * the process of exiting and releasing its resources.
- * Otherwise we could get an easy OOM deadlock.
- */
- if (p->flags & PF_EXITING) {
- if (p != current)
- return ERR_PTR(-1UL);
-
- chosen = p;
- *ppoints = ULONG_MAX;
- }
-
points = badness(p, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
@@ -444,15 +426,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem);

- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just set TIF_MEMDIE so it can die quickly
- */
- if (p->flags & PF_EXITING) {
- __oom_kill_process(p, mem, 0);
- return 0;
- }
-
printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
message, task_pid_nr(p), p->comm, points);

--
1.6.5.2


2010-06-03 06:26:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 10/12] oom: sacrifice child with highest badness score for parent

From: David Rientjes <[email protected]>

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.
Unfortunately, this often kills in a seemingly random fashion based
on the ordering of the selected task's child list. Additionally, it
is not guaranteed at all to free a large amount of memory that we need
to prevent additional oom killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing
its parent's memory, if one exists. The worst child is indicated with
the highest badness() score. This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful
and then the parent if none were found while the new code simply kills
the most memory-hogging task or the parent. Note that the only time
__oom_kill_process() fails, however, is when a child does not have an
mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
cases, so the final __oom_kill_process() will always succeed.

Acked-by: Rik van Riel <[email protected]>
Acked-by: Nick Piggin <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 23 ++++++++++++++++-------
1 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5d723fb..e4c6141 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -422,26 +422,35 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
{
struct task_struct *c;
struct task_struct *t = p;
+ struct task_struct *victim = p;
+ unsigned long victim_points = 0;
+ struct timespec uptime;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem);

- printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
- message, task_pid_nr(p), p->comm, points);
+ pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);

- /* Try to kill a child first */
+ do_posix_clock_monotonic_gettime(&uptime);
+ /* Try to sacrifice the worst child first */
do {
list_for_each_entry(c, &t->children, sibling) {
+ unsigned long cpoints;
+
if (c->mm == p->mm)
continue;

- /* Ok, Kill the child */
- if (!__oom_kill_process(c, mem, 1))
- return 0;
+ /* badness() returns 0 if the thread is unkillable */
+ cpoints = badness(c, uptime.tv_sec);
+ if (cpoints > victim_points) {
+ victim = c;
+ victim_points = cpoints;
+ }
}
} while_each_thread(p, t);

- return __oom_kill_process(p, mem, 1);
+ return __oom_kill_process(victim, mem, 1);
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
--
1.6.5.2


2010-06-03 06:26:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 11/12] oom: remove special handling for pagefault ooms

From: David Rientjes <[email protected]>

It is possible to remove the special pagefault oom handler by simply oom
locking all system zones and then calling directly into out_of_memory().

All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
parallel oom killing in progress that will lead to eventual memory
freeing so it's not necessary to needlessly kill another task. The context
in which the pagefault is allocating memory is unknown to the oom killer,
so this is done on a system-wide level.

If a task has already been oom killed and hasn't fully exited yet, this
will be a no-op since select_bad_process() recognizes tasks across the
system with TIF_MEMDIE set.

Acked-by: Nick Piggin <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 86 +++++++++++++++++++++++++++++++++++++--------------------
1 files changed, 56 insertions(+), 30 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e4c6141..67b5fa5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -540,6 +540,43 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
}

/*
+ * Try to acquire the oom killer lock for all system zones. Returns zero if a
+ * parallel oom killing is taking place, otherwise locks all zones and returns
+ * non-zero.
+ */
+static int try_set_system_oom(void)
+{
+ struct zone *zone;
+ int ret = 1;
+
+ spin_lock(&zone_scan_lock);
+ for_each_populated_zone(zone)
+ if (zone_is_oom_locked(zone)) {
+ ret = 0;
+ goto out;
+ }
+ for_each_populated_zone(zone)
+ zone_set_flag(zone, ZONE_OOM_LOCKED);
+out:
+ spin_unlock(&zone_scan_lock);
+ return ret;
+}
+
+/*
+ * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
+ * attempts or page faults may now recall the oom killer, if necessary.
+ */
+static void clear_system_oom(void)
+{
+ struct zone *zone;
+
+ spin_lock(&zone_scan_lock);
+ for_each_populated_zone(zone)
+ zone_clear_flag(zone, ZONE_OOM_LOCKED);
+ spin_unlock(&zone_scan_lock);
+}
+
+/*
* Must be called with tasklist_lock held for read.
*/
static void __out_of_memory(gfp_t gfp_mask, int order)
@@ -573,34 +610,6 @@ retry:
goto retry;
}

-/*
- * pagefault handler calls into here because it is out of memory but
- * doesn't know exactly how or why.
- */
-void pagefault_out_of_memory(void)
-{
- unsigned long freed = 0;
-
- blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
- if (freed > 0)
- /* Got some memory back in the last second. */
- return;
-
- if (sysctl_panic_on_oom)
- panic("out of memory from page fault. panic_on_oom is selected.\n");
-
- read_lock(&tasklist_lock);
- __out_of_memory(0, 0); /* unknown gfp_mask and order */
- read_unlock(&tasklist_lock);
-
- /*
- * Give "p" a good chance of killing itself before we
- * retry to allocate memory.
- */
- if (!test_thread_flag(TIF_MEMDIE))
- schedule_timeout_uninterruptible(1);
-}
-
/**
* out_of_memory - kill the "best" process when we run out of memory
* @zonelist: zonelist pointer
@@ -616,7 +625,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
- enum oom_constraint constraint;
+ enum oom_constraint constraint = CONSTRAINT_NONE;

blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
@@ -632,7 +641,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+ if (zonelist)
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);

switch (constraint) {
@@ -661,3 +671,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
if (!test_thread_flag(TIF_MEMDIE))
schedule_timeout_uninterruptible(1);
}
+
+/*
+ * The pagefault handler calls here because it is out of memory, so kill a
+ * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a parallel
+ * oom killing is already in progress so do nothing. If a task is found with
+ * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ */
+void pagefault_out_of_memory(void)
+{
+ if (try_set_system_oom()) {
+ out_of_memory(NULL, 0, 0, NULL);
+ clear_system_oom();
+ }
+ if (!test_thread_flag(TIF_MEMDIE))
+ schedule_timeout_uninterruptible(1);
+}
--
1.6.5.2


2010-06-03 06:27:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 12/12] oom: give current access to memory reserves if it has been killed

From: David Rientjes <[email protected]>

It's possible to livelock the page allocator if a thread has
mm->mmap_sem and fails to make forward progress because the
oom killer selects another thread sharing the same ->mm to
kill that cannot exit until the semaphore is dropped.

The oom killer will not kill multiple tasks at the same time; each oom
killed task must exit before another task may be killed. Thus, if one
thread is holding mm->mmap_sem and cannot allocate memory, all threads
sharing the same ->mm are blocked from exiting as well. In the oom kill
case, that means the thread holding mm->mmap_sem will never free
additional memory since it cannot get access to memory reserves and the
thread that depends on it with access to memory reserves cannot exit
because it cannot acquire the semaphore. Thus, the page allocators
livelocks.

When the oom killer is called and current happens to have a pending
SIGKILL, this patch automatically gives it access to memory reserves and
returns. Upon returning to the page allocator, its allocation will
hopefully succeed so it can quickly exit and free its memory. If not,
the page allocator will fail the allocation if it is not __GFP_NOFAIL.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 67b5fa5..ad85e1b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -638,6 +638,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
}

/*
+ * If current has a pending SIGKILL, then automatically select it. The
+ * goal is to allow it to allocate so that it may quickly exit and free
+ * its memory.
+ */
+ if (fatal_signal_pending(current)) {
+ set_tsk_thread_flag(current, TIF_MEMDIE);
+ return;
+ }
+
+ /*
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
--
1.6.5.2


2010-06-03 06:35:01

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 09/12] oom: remove PF_EXITING check completely

On Thu, 3 Jun 2010, KOSAKI Motohiro wrote:

> Currently, PF_EXITING check is completely broken. because 1) It only
> care main-thread and ignore sub-threads

Then check the subthreads.

> 2) If user enable core-dump
> feature, it can makes deadlock because the task during coredump ignore
> SIGKILL.
>

It may ignore SIGKILL, but does not ignore fatal_signal_pending() being
true which gives it access to memory reserves with my patchset so that it
may quickly finish.

> The deadlock is certenaly worst result, then, minor PF_EXITING
> optimization worth is relatively ignorable.
>
> This patch removes it.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Acked-by: Oleg Nesterov <[email protected]>

Nacked-by: David Rientjes <[email protected]>

You have no real world experience in using the oom killer for memory
containment and don't understand how critical it is to protect other
vital system tasks that are needlessly killed as the result of this patch.

2010-06-03 06:35:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 06/12] oom: remove warning for in mm-less task __oom_kill_process()

On Thu, 3 Jun 2010 15:23:03 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> If the race of mm detach in task exiting vs oom is happen,
> find_lock_task_mm() can be return NULL.
>
> So, the warning is pointless.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2010-06-03 06:37:39

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 06/12] oom: remove warning for in mm-less task __oom_kill_process()

On Thu, 3 Jun 2010, KOSAKI Motohiro wrote:

> If the race of mm detach in task exiting vs oom is happen,
> find_lock_task_mm() can be return NULL.
>
> So, the warning is pointless.
>

Oh, please. This isn't rc material.

I already remove this entire function in my patchset, why are you even
bothering?

2010-06-03 06:37:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 07/12] oom: Fix child process iteration properly

On Thu, 3 Jun 2010 15:23:50 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Oleg pointed out that current oom child process iterating logic is wrong.
>
> > list_for_each_entry(p->children) can only see the tasks forked
> > by p, it can't see other children forked by its sub-threads.
>
> This patch fixes it.
>
> Reported-by: Oleg Nesterov <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2010-06-03 06:38:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 08/12] oom: dump_tasks() use find_lock_task_mm() too

On Thu, 3 Jun 2010 15:24:36 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> dump_task() should use find_lock_task_mm() too. It is necessary for
> protecting task-exiting race.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2010-06-03 06:40:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 09/12] oom: remove PF_EXITING check completely

On Thu, 3 Jun 2010 15:25:18 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Currently, PF_EXITING check is completely broken. because 1) It only
> care main-thread and ignore sub-threads 2) If user enable core-dump
> feature, it can makes deadlock because the task during coredump ignore
> SIGKILL.
>
> The deadlock is certenaly worst result, then, minor PF_EXITING
> optimization worth is relatively ignorable.
>
> This patch removes it.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Acked-by: Oleg Nesterov <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2010-06-03 06:52:50

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 02/12] oom: introduce find_lock_task_mm() to fix !mm false positives

> > Signed-off-by: Oleg Nesterov <[email protected]>
> > Cc: David Rientjes <[email protected]>
> > Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
>
> Could you see my previous comment?
> http://lkml.org/lkml/2010/6/2/325
> Anyway, I added my review sign
>
> Reviewed-by: Minchan Kim <[email protected]>

Sorry, I had lost your comment ;)

But personally I don't like find_alive_subthread() because
such function actually does,
1) iterate threads in the same thread group
2) find alive (a.k.a have ->mm) thread
3) lock the task
and, I think (3) is most important role of this function.
So, I prefer to contain "lock" word.

Otherwise, people easily forget to cann task_unlock().
But I'm ok to rename any give me better name.

Thanks.

2010-06-03 14:02:10

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 09/12] oom: remove PF_EXITING check completely

On 06/02, David Rientjes wrote:
>
> On Thu, 3 Jun 2010, KOSAKI Motohiro wrote:
>
> > Currently, PF_EXITING check is completely broken. because 1) It only
> > care main-thread and ignore sub-threads
>
> Then check the subthreads.
>
> > 2) If user enable core-dump
> > feature, it can makes deadlock because the task during coredump ignore
> > SIGKILL.
> >
>
> It may ignore SIGKILL, but does not ignore fatal_signal_pending() being
> true

Wrong.

Unless the oom victim is exactly the thread which dumps the core,
fatal_signal_pending() won't be true for the dumper. Even if the
victim and the dumper are from the same group, this thread group
already has SIGNAL_GROUP_EXIT. And if they do not belong to the
same group, SIGKILL has even less effect.

Even if we chose the right thread we can race with
clear_thread_flag(TIF_SIGPENDING), but fatal_signal_pending()
checks signal_pending().

> which gives it access to memory reserves with my patchset

__get_user_pages() already checks fatal_signal_pending(), this
is where the dumper allocates the memory (mostly).

And I am not sure I understand the "access to memory reserves",
the dumper should just stop if oom-kill decides it should die,
it can use a lot more memory if it doesn't stop.

> Nacked-by: David Rientjes <[email protected]>

Kosaki removes the code which only pretends to work, but it doesn't
and leads to problems.

If you think we need this check, imho it is better to make the patch
which adds the "right" code with the nice changelog explaining how
this code works.


Just my opinion, I know very little about oom logic/needs/problems,
you can ignore me.

Oleg.

2010-06-03 15:23:19

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 08/12] oom: dump_tasks() use find_lock_task_mm() too

On 06/03, KOSAKI Motohiro wrote:
>
> dump_task() should use find_lock_task_mm() too. It is necessary for
> protecting task-exiting race.

This patch also replaces the pointless do_each_thread() with
for_each_process(), good.

Reviewed-by: Oleg Nesterov <[email protected]>

> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/oom_kill.c | 37 ++++++++++++++++++++-----------------
> 1 files changed, 20 insertions(+), 17 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 35a2ecc..6360c56 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -345,35 +345,38 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> */
> static void dump_tasks(const struct mem_cgroup *mem)
> {
> - struct task_struct *g, *p;
> + struct task_struct *p;
> + struct task_struct *task;
>
> printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj "
> "name\n");
> - do_each_thread(g, p) {
> - struct mm_struct *mm;
>
> - if (mem && !task_in_mem_cgroup(p, mem))
> + for_each_process(p) {
> + /*
> + * We don't have is_global_init() check here, because the old
> + * code do that. printing init process is not big matter. But
> + * we don't hope to make unnecessary compatiblity breaking.
> + */
> + if (p->flags & PF_KTHREAD)
> continue;
> - if (!thread_group_leader(p))
> + if (mem && !task_in_mem_cgroup(p, mem))
> continue;
>
> - task_lock(p);
> - mm = p->mm;
> - if (!mm) {
> + task = find_lock_task_mm(p);
> + if (!task)
> /*
> - * total_vm and rss sizes do not exist for tasks with no
> - * mm so there's no need to report them; they can't be
> - * oom killed anyway.
> + * Probably oom vs task-exiting race was happen and ->mm
> + * have been detached. thus there's no need to report them;
> + * they can't be oom killed anyway.
> */
> - task_unlock(p);
> continue;
> - }
> +
> printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n",
> - p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> - get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
> + task->pid, __task_cred(task)->uid, task->tgid, task->mm->total_vm,
> + get_mm_rss(task->mm), (int)task_cpu(task), task->signal->oom_adj,
> p->comm);
> - task_unlock(p);
> - } while_each_thread(g, p);
> + task_unlock(task);
> + }
> }
>
> static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> --
> 1.6.5.2
>
>
>

2010-06-03 15:28:34

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 08/12] oom: dump_tasks() use find_lock_task_mm() too

(off-topic)

out_of_memory() calls dump_header()->dump_tasks() lockless, we
need tasklist.

Oleg.

2010-06-03 20:12:31

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 08/12] oom: dump_tasks() use find_lock_task_mm() too

On Thu, 3 Jun 2010, Oleg Nesterov wrote:

> (off-topic)
>
> out_of_memory() calls dump_header()->dump_tasks() lockless, we
> need tasklist.
>

Already fixed in my rewrite patchset, as most of these things are. Sigh.

2010-06-03 20:26:35

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 09/12] oom: remove PF_EXITING check completely

On Thu, 3 Jun 2010, Oleg Nesterov wrote:

> On 06/02, David Rientjes wrote:
> >
> > On Thu, 3 Jun 2010, KOSAKI Motohiro wrote:
> >
> > > Currently, PF_EXITING check is completely broken. because 1) It only
> > > care main-thread and ignore sub-threads
> >
> > Then check the subthreads.
> >

Did you want to respond to this?

> > > 2) If user enable core-dump
> > > feature, it can makes deadlock because the task during coredump ignore
> > > SIGKILL.
> > >
> >
> > It may ignore SIGKILL, but does not ignore fatal_signal_pending() being
> > true
>
> Wrong.
>
> Unless the oom victim is exactly the thread which dumps the core,
> fatal_signal_pending() won't be true for the dumper. Even if the
> victim and the dumper are from the same group, this thread group
> already has SIGNAL_GROUP_EXIT. And if they do not belong to the
> same group, SIGKILL has even less effect.
>

I'm guessing at the relevancy here because the changelog is extremely
poorly worded (if I were Andrew I would have no idea how important this
patch is based on the description other than the alarmist words of "... is
completely broken)", but if we're concerned about the coredumper not being
able to find adequate resources to allocate memory from, we can give it
access to reserves specifically, we don't need to go killing additional
tasks which may have their own coredumpers.

> Even if we chose the right thread we can race with
> clear_thread_flag(TIF_SIGPENDING), but fatal_signal_pending()
> checks signal_pending().
>
> > which gives it access to memory reserves with my patchset
>
> __get_user_pages() already checks fatal_signal_pending(), this
> is where the dumper allocates the memory (mostly).
>
> And I am not sure I understand the "access to memory reserves",
> the dumper should just stop if oom-kill decides it should die,
> it can use a lot more memory if it doesn't stop.
>

That's an alternative solution as well, but I'm disagreeing with the
approach here because this enforces absolutely no guarantee that the next
task to be oom killed will be the coredumper, its much more likely that
we're just going to kill yet another task for the coredump. That task may
have a coredumper too. Who knows.

> > Nacked-by: David Rientjes <[email protected]>
>
> Kosaki removes the code which only pretends to work, but it doesn't
> and leads to problems.
>

LOL, this code doesn't pretend to work, we rely heavily on it and have for
three years to prevent needless oom killing. We use the oom killer
constantly on every machine with memory isolation for enforcing a policy,
as the memcg will as well as it becomes more popular. Killing a job's
task is a serious matter and we'd prefer to kill as few as possible to
free memory. That's the entire motivation behind having a badness()
heuristic in the first place: so we don't kill tons of tasks to free
enough memory. Removing this check specifically will cause the oom killer
to kill tasks when it is currently a no-op and will free memory and we
have never had a problem with these so-called "deadlocks" that are being
found only by code inspection. So please care a little bit more about the
ramifications of other people's use of Linux before you go and insist that
certain code doesn't do a complete job in certain cases or it can
introduce a deadlock in situations that are anything but from the real
world must be removed entirely even though it has saved tons of our jobs
over the course of the past three years.

> If you think we need this check, imho it is better to make the patch
> which adds the "right" code with the nice changelog explaining how
> this code works.
>

I've got no objection to that, but until you find the right solution,
please don't remove what works for people in practice. Show me an oom
deadlock as the result of this and the use of sane applications. Nobody
reports them because these are such ridiculous cornercases that are being
addressed by sweeping and extreme code changes without other people's
input being considered.

Since Google uses the oom killer to enforce memory isolation on every one
of our machines, and we do it at a scale that is unparalleled to anybody
else, I think you should consider this input.

2010-06-03 22:02:37

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 08/12] oom: dump_tasks() use find_lock_task_mm() too

On 06/03, David Rientjes wrote:
>
> On Thu, 3 Jun 2010, Oleg Nesterov wrote:
>
> > (off-topic)
> >
> > out_of_memory() calls dump_header()->dump_tasks() lockless, we
> > need tasklist.

forgot to mention, __out_of_memory() too.

> Already fixed in my rewrite patchset, as most of these things are. Sigh.

In 3/18, without any note in the changelog. Another minor thing
which can be fixed before rewrite.

And please note that it was me who pointed out we need tasklist
during the previous discussion. I'd suggest you to send a separate
patch on top of Kosaki's patches.

OK, this is boring ;) I am going to ignore everything except
technical issues in this thread.

Oleg.

2010-06-03 22:13:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 09/12] oom: remove PF_EXITING check completely

On 06/03, David Rientjes wrote:
>
> On Thu, 3 Jun 2010, Oleg Nesterov wrote:
>
> > On 06/02, David Rientjes wrote:
> > >
> > > On Thu, 3 Jun 2010, KOSAKI Motohiro wrote:
> > >
> > > > Currently, PF_EXITING check is completely broken. because 1) It only
> > > > care main-thread and ignore sub-threads
> > >
> > > Then check the subthreads.
> > >
>
> Did you want to respond to this?

Please explain what you mean. There were already a lot of discussions
about mt issues, I do not know what you have in mind.

> > > It may ignore SIGKILL, but does not ignore fatal_signal_pending() being
> > > true
> >
> > Wrong.
> >
> > Unless the oom victim is exactly the thread which dumps the core,
> > fatal_signal_pending() won't be true for the dumper. Even if the
> > victim and the dumper are from the same group, this thread group
> > already has SIGNAL_GROUP_EXIT. And if they do not belong to the
> > same group, SIGKILL has even less effect.
> >
>
> I'm guessing at the relevancy here because the changelog is extremely
> poorly worded (if I were Andrew I would have no idea how important this
> patch is based on the description other than the alarmist words of "... is
> completely broken)", but if we're concerned about the coredumper not being
> able to find adequate resources to allocate memory from, we can give it
> access to reserves specifically,

I don't think so. If oom-kill wants to kill the task which dumps the
code, it should stop the coredumping and exit.

> we don't need to go killing additional
> tasks which may have their own coredumpers.

Sorry, can't understand.

> That's an alternative solution as well, but I'm disagreeing with the
> approach here because this enforces absolutely no guarantee that the next
> task to be oom killed will be the coredumper, its much more likely that
> we're just going to kill yet another task for the coredump. That task may
> have a coredumper too. Who knows.

Again, please explain this to me.

> > > Nacked-by: David Rientjes <[email protected]>
> >
> > Kosaki removes the code which only pretends to work, but it doesn't
> > and leads to problems.
> >
>
> LOL, this code doesn't pretend to work,
> ...
> certain code doesn't do a complete job in certain cases or it can
> introduce a deadlock in situations

OK, agreed. It is not that it never works.

Oleg.

2010-06-03 23:18:22

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 08/12] oom: dump_tasks() use find_lock_task_mm() too

On Fri, 4 Jun 2010, Oleg Nesterov wrote:

> On 06/03, David Rientjes wrote:
> >
> > On Thu, 3 Jun 2010, Oleg Nesterov wrote:
> >
> > > (off-topic)
> > >
> > > out_of_memory() calls dump_header()->dump_tasks() lockless, we
> > > need tasklist.
>
> forgot to mention, __out_of_memory() too.
>
> > Already fixed in my rewrite patchset, as most of these things are. Sigh.
>
> In 3/18, without any note in the changelog. Another minor thing
> which can be fixed before rewrite.
>

It's _not_ rc material, we don't merge patches into rc kernels where the
end result is panic() in all cases anyway.

2010-06-03 23:24:07

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 09/12] oom: remove PF_EXITING check completely

On Fri, 4 Jun 2010, Oleg Nesterov wrote:

> > > > > Currently, PF_EXITING check is completely broken. because 1) It only
> > > > > care main-thread and ignore sub-threads
> > > >
> > > > Then check the subthreads.
> > > >
> >
> > Did you want to respond to this?
>
> Please explain what you mean. There were already a lot of discussions
> about mt issues, I do not know what you have in mind.
>

Can you check the subthreads to see if they are not PF_EXITING?

> > I'm guessing at the relevancy here because the changelog is extremely
> > poorly worded (if I were Andrew I would have no idea how important this
> > patch is based on the description other than the alarmist words of "... is
> > completely broken)", but if we're concerned about the coredumper not being
> > able to find adequate resources to allocate memory from, we can give it
> > access to reserves specifically,
>
> I don't think so. If oom-kill wants to kill the task which dumps the
> code, it should stop the coredumping and exit.
>

That's a coredump change, not an oom killer change. If the coredumper
needs memory and runs into the oom killer, this PF_EXITING check, which
you want to remove, gives it access to memory reserves by setting
TIF_MEMDIE so it can quickly finish and die. This allows it to exit
without oom killing anything else because the tasklist scan in the oom
killer is not preempted by finding a TIF_MEMDIE task.

2010-06-04 10:06:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 09/12] oom: remove PF_EXITING check completely

On 06/03, David Rientjes wrote:
>
> On Fri, 4 Jun 2010, Oleg Nesterov wrote:
>
> > > > > > Currently, PF_EXITING check is completely broken. because 1) It only
> > > > > > care main-thread and ignore sub-threads
> > > > >
> > > > > Then check the subthreads.
> > > > >
> > >
> > > Did you want to respond to this?
> >
> > Please explain what you mean. There were already a lot of discussions
> > about mt issues, I do not know what you have in mind.
>
> Can you check the subthreads to see if they are not PF_EXITING?

To detect the process with the dead group leader?

Yes, we can. We already discussed this. Probably it is better to check
PF_EXITING and signal_group_exit().

> > > I'm guessing at the relevancy here because the changelog is extremely
> > > poorly worded (if I were Andrew I would have no idea how important this
> > > patch is based on the description other than the alarmist words of "... is
> > > completely broken)", but if we're concerned about the coredumper not being
> > > able to find adequate resources to allocate memory from, we can give it
> > > access to reserves specifically,
> >
> > I don't think so. If oom-kill wants to kill the task which dumps the
> > code, it should stop the coredumping and exit.
>
> That's a coredump change, not an oom killer change.

Yes. do_coredump() should be fixed. This is not trivial (and needs the
subtle changes outside of fs/exec.c), we are looking for the simple fix
for now.

> If the coredumper
> needs memory and runs into the oom killer, this PF_EXITING check, which
> you want to remove, gives it access to memory reserves by setting
> TIF_MEMDIE so it can quickly finish and die. This allows it to exit
> without oom killing anything else because the tasklist scan in the oom
> killer is not preempted by finding a TIF_MEMDIE task.

David, sorry. I already tried to explain (at least twice) that TIF_MEMDIE
(or SIGKILL even if do_coredump() was interruptible) can not help unless
you find the right thread, this is not trivial even if we forget about
CLONE_VM tasks.

And personally I disagree that it should use memory reserves, but this
doesn't matter.


Let's stop this. You shouldn't convince me. I am not the author of this
patch, and I said many times that I do not pretend I understand oom-kill
needs. I jumped into this discussion because your initial objection
(fatal_signal_pending() should fix the problems) was technically wrong.

Oleg.

2010-06-04 10:54:49

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 09/12] oom: remove PF_EXITING check completely

> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > Acked-by: Oleg Nesterov <[email protected]>
>
> Nacked-by: David Rientjes <[email protected]>
>
> You have no real world experience in using the oom killer for memory
> containment and don't understand how critical it is to protect other
> vital system tasks that are needlessly killed as the result of this patch.

Heh, You really think "you have no experience" takes meaningful argument?
If I'd say "You have no real world experience in using linux", you are
satisfied?

Anyway, I don't care such "Please don't fix the bug" claim. the bug is
bug. not something else.


2010-06-04 10:54:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 13/12] oom: dump_header() need tasklist_lock

> (off-topic)
>
> out_of_memory() calls dump_header()->dump_tasks() lockless, we
> need tasklist.

Makes perfectly sense. Thank you!


================================================================
Commit 1b604d75(oom: dump stack and VM state when oom killer panics)
added dump_header() call to three places. But It forgot to add
the tasklist_lock. Now, dump_header() is called without the lock.
It is obviously inadequate.

This patch fixes it.

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ad85e1b..2678a04 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -600,8 +600,8 @@ retry:

/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p) {
- read_unlock(&tasklist_lock);
dump_header(NULL, gfp_mask, order, NULL);
+ read_unlock(&tasklist_lock);
panic("Out of memory and no killable processes...\n");
}

@@ -633,7 +633,9 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
return;

if (sysctl_panic_on_oom == 2) {
+ read_lock(&tasklist_lock);
dump_header(NULL, gfp_mask, order, NULL);
+ read_unlock(&tasklist_lock);
panic("out of memory. Compulsory panic_on_oom is selected.\n");
}

@@ -664,6 +666,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
case CONSTRAINT_NONE:
if (sysctl_panic_on_oom) {
dump_header(NULL, gfp_mask, order, NULL);
+ read_unlock(&tasklist_lock);
panic("out of memory. panic_on_oom is selected\n");
}
/* Fall-through */
--
1.6.5.2




2010-06-08 11:41:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [mmotm 0521][PATCH 0/12] various OOM fixes for 2.6.35

> Hi
>
> This patch series is collection of various OOM bugfixes. I think
> all of patches can send to 2.6.35.
> Recently, David Rientjes and Luis Claudio R. Goncalves posted other
> various imporovement. I'll collect such 2.6.36 items and I plan to
> push -mm at next week.

Today, Linus shipped -rc2 and this series automatically slipped for 2.6.36.


2010-06-08 11:41:59

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [mmotm 0521][PATCH 0/12] various OOM fixes for 2.6.35

> Hi
>
> This patch series is collection of various OOM bugfixes. I think
> all of patches can send to 2.6.35.
> Recently, David Rientjes and Luis Claudio R. Goncalves posted other
> various imporovement. I'll collect such 2.6.36 items and I plan to
> push -mm at next week.

Linus shipped -rc2 and this series automatically slipped for 2.6.36.


2010-06-08 11:42:12

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 12/12] oom: give current access to memory reserves if it has been killed

> From: David Rientjes <[email protected]>
>
> It's possible to livelock the page allocator if a thread has
> mm->mmap_sem and fails to make forward progress because the
> oom killer selects another thread sharing the same ->mm to
> kill that cannot exit until the semaphore is dropped.
>
> The oom killer will not kill multiple tasks at the same time; each oom
> killed task must exit before another task may be killed. Thus, if one
> thread is holding mm->mmap_sem and cannot allocate memory, all threads
> sharing the same ->mm are blocked from exiting as well. In the oom kill
> case, that means the thread holding mm->mmap_sem will never free
> additional memory since it cannot get access to memory reserves and the
> thread that depends on it with access to memory reserves cannot exit
> because it cannot acquire the semaphore. Thus, the page allocators
> livelocks.
>
> When the oom killer is called and current happens to have a pending
> SIGKILL, this patch automatically gives it access to memory reserves and
> returns. Upon returning to the page allocator, its allocation will
> hopefully succeed so it can quickly exit and free its memory. If not,
> the page allocator will fail the allocation if it is not __GFP_NOFAIL.
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/oom_kill.c | 10 ++++++++++
> 1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 67b5fa5..ad85e1b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -638,6 +638,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> }
>
> /*
> + * If current has a pending SIGKILL, then automatically select it. The
> + * goal is to allow it to allocate so that it may quickly exit and free
> + * its memory.
> + */
> + if (fatal_signal_pending(current)) {
> + set_tsk_thread_flag(current, TIF_MEMDIE);
> + return;
> + }

Self NAK this.
We have no gurantee that current is oom killable. Oh, here is
out_of_memory(), sigh.


2010-06-08 18:27:01

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 12/12] oom: give current access to memory reserves if it has been killed

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 67b5fa5..ad85e1b 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -638,6 +638,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > }
> >
> > /*
> > + * If current has a pending SIGKILL, then automatically select it. The
> > + * goal is to allow it to allocate so that it may quickly exit and free
> > + * its memory.
> > + */
> > + if (fatal_signal_pending(current)) {
> > + set_tsk_thread_flag(current, TIF_MEMDIE);
> > + return;
> > + }
>
> Self NAK this.
> We have no gurantee that current is oom killable. Oh, here is
> out_of_memory(), sigh.
>

We're not killing it, it's already dying. We're simply giving it access
to memory reserves so it may allocate and quickly exit to free its memory.
Being OOM_DISABLE does not imply the task cannot exit or use memory
reserves in the exit path.