LinuxLists.cc - Re: [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account

[permalink] [raw]

Subject: [PATCH] oom: remove PF_EXITING check completely

> On 06/01, KOSAKI Motohiro wrote:
> >
> > > I'd like to add a note... with or without this, we have problems
> > > with the coredump. A thread participating in the coredumping
> > > (group-leader in this case) can have PF_EXITING && mm, but this doesn't
> > > mean it is going to exit soon, and the dumper can use a lot more memory.
> >
> > Sure. I think coredump sould do nothing if oom occur.
> > So, merely making PF_COREDUMP is bad idea? I mean
> >
> > task-flags allocator
> > ------------------------------------------------
> > none N/A
> > TIF_MEMDIE allow to use emergency memory.
> > don't call page reclaim.
> > PF_COREDUMP N/A
> > TIF_MEMDIE+PF_COREDUMP disallow to use emergency memory.
> > don't call page reclaim.
> >
> > In other word, coredump path makes allocation failure if the task
> > marked as TIF_MEMDIE.
>
> Perhaps... But where should TIF_MEMDIE go this case? Let me clarify.
>
> Two threads, group-leader L and its sub-thread T. T dumps the code.
> In this case both threads have ->mm != NULL, L has PF_EXITING.
>
> The first problem is, select_bad_process() always return -1 in this
> case (even if the caller is T, this doesn't matter).
>
> The second problem is that we should add TIF_MEMDIE to T, not L.
>
> This is more or less easy. For simplicity, let's suppose we removed
> this PF_EXITING check from select_bad_process().

Today, I've thought to make some bandaid patches for this issue. but
yes, I've reached the same conclusion.

If we think multithread and core dump situation, all fixes are just
bandaid. We can't remove deadlock chance completely.

The deadlock is certenaly worst result, then, minor PF_EXITING optimization
doesn't have so much worth.

==============================================================
Subject: [PATCH] oom: remove PF_EXITING check completely

PF_EXITING is wrong check if the task have multiple threads. This patch
removes it.

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Nick Piggin <[email protected]>
---
mm/oom_kill.c | 27 ---------------------------
1 files changed, 0 insertions(+), 27 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e7f0f9..b06f8d1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -302,24 +302,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
if (test_tsk_thread_flag(p, TIF_MEMDIE))
return ERR_PTR(-1UL);

- /*
- * This is in the process of releasing memory so wait for it
- * to finish before killing some other task by mistake.
- *
- * However, if p is the current task, we allow the 'kill' to
- * go ahead if it is exiting: this will simply set TIF_MEMDIE,
- * which will allow it to gain access to memory reserves in
- * the process of exiting and releasing its resources.
- * Otherwise we could get an easy OOM deadlock.
- */
- if ((p->flags & PF_EXITING) && p->mm) {
- if (p != current)
- return ERR_PTR(-1UL);
-
- chosen = p;
- *ppoints = ULONG_MAX;
- }
-
points = badness(p, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
@@ -436,15 +418,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem);

- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just set TIF_MEMDIE so it can die quickly
- */
- if (p->flags & PF_EXITING) {
- __oom_kill_process(p, mem, 0);
- return 0;
- }
-
printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
message, task_pid_nr(p), p->comm, points);

--
1.6.5.2

2010-06-02 13:54:47

[permalink] [raw]

Subject: [PATCH] oom: Make coredump interruptible

> Otoh, if we make do_coredump() interruptible (and we should do this
> in any case), then perhaps the TIF_MEMDIE+PF_COREDUMP is not really
> needed? Afaics we always send SIGKILL along with TIF_MEMDIE.

How is to make per-process oom flag + interruptible coredump?

this per-process oom flag can be used vmscan shortcut exiting too.
(IOW, It can help DavidR mmap_sem issue)

===========================================================
Subject: [PATCH] oom: Make coredump interruptible

If oom victim process is under core dumping, sending SIGKILL cause
no-op. Unfortunately, coredump need relatively much memory. It mean
OOM vs coredump can makes deadlock.

Then, coredump logic should check the task has received SIGKILL
from OOM.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/binfmt_elf.c | 5 +++++
include/linux/sched.h | 3 +++
mm/oom_kill.c | 1 +
3 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 535e763..aa47979 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2038,6 +2038,11 @@ static int elf_core_dump(struct coredump_params *cprm)
page_cache_release(page);
} else
stop = !dump_seek(cprm->file, PAGE_SIZE);
+
+ /* Now, The process received OOM. Exit soon! */
+ if (current->signal->oom_victim)
+ stop = 1;
+
if (stop)
goto end_coredump;
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8485aa2..1c4fa86 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -544,6 +544,9 @@ struct signal_struct {
int notify_count;
struct task_struct *group_exit_task;

+ /* true mean the process is OOM-killer victim. */
+ bool oom_victim;
+
/* thread group stop support, overloads group_exit_code too */
int group_stop_count;
unsigned int flags; /* see SIGNAL_* flags below */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f33a1b8..39e31bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -400,6 +400,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
+ p->signal->oom_victim = true;

force_sig(SIGKILL, p);

--
1.6.5.2

2010-06-02 15:43:40

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

(add Roland)

On 06/02, KOSAKI Motohiro wrote:
>
> > Otoh, if we make do_coredump() interruptible (and we should do this
> > in any case), then perhaps the TIF_MEMDIE+PF_COREDUMP is not really
> > needed? Afaics we always send SIGKILL along with TIF_MEMDIE.
>
> How is to make per-process oom flag + interruptible coredump?
>
> this per-process oom flag can be used vmscan shortcut exiting too.
> (IOW, It can help DavidR mmap_sem issue)

Firstly, this solution is not complete. We should make it really
interruptible (from user-space too), but we need more changes for
this (in particular we need to distinguish group-exit/exec cases
from the explicit SIGKILL case). Let's not discuss this here, this
is the different story.

But. I agree very much that it makes sense to add the quick fix
right now. Even if this fix will be superseded by the "proper"
fixes later.

> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -2038,6 +2038,11 @@ static int elf_core_dump(struct coredump_params *cprm)
> page_cache_release(page);
> } else
> stop = !dump_seek(cprm->file, PAGE_SIZE);
> +
> + /* Now, The process received OOM. Exit soon! */
> + if (current->signal->oom_victim)
> + stop = 1;

Agreed, most problems with memory allocations should come from this loop.

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -544,6 +544,9 @@ struct signal_struct {
> int notify_count;
> struct task_struct *group_exit_task;
>
> + /* true mean the process is OOM-killer victim. */
> + bool oom_victim;

Well, the new word in signal_struct is not nice. It is better to
set SIGNAL_OOM_XXX in ->signal->flags (this needs ->siglock).

But. I don't think that signal_struct is the right place for the marker.

The thread which actually dumps the core doesn't necessarily belong
to the same thread group, but it can share ->mm with the selected
oom victim.

IOW, we should mark ->mm instead (perhaps mm->flags) or mm->core_state.
This in turn means we need find_lock_task_mm().

What do you think?

Oleg.

2010-06-02 15:56:58

[permalink] [raw]

Subject: Re: [PATCH] oom: remove PF_EXITING check completely

On 06/02, KOSAKI Motohiro wrote:
>
> Today, I've thought to make some bandaid patches for this issue. but
> yes, I've reached the same conclusion.
>
> If we think multithread and core dump situation, all fixes are just
> bandaid. We can't remove deadlock chance completely.
>
> The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> doesn't have so much worth.

Agreed! I was always wondering if it really helps in practice.

> Subject: [PATCH] oom: remove PF_EXITING check completely
>
> PF_EXITING is wrong check if the task have multiple threads. This patch
> removes it.
>
> Suggested-by: Oleg Nesterov <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: Nick Piggin <[email protected]>
> ---
> mm/oom_kill.c | 27 ---------------------------
> 1 files changed, 0 insertions(+), 27 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e7f0f9..b06f8d1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -302,24 +302,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> if (test_tsk_thread_flag(p, TIF_MEMDIE))
> return ERR_PTR(-1UL);
>
> - /*
> - * This is in the process of releasing memory so wait for it
> - * to finish before killing some other task by mistake.
> - *
> - * However, if p is the current task, we allow the 'kill' to
> - * go ahead if it is exiting: this will simply set TIF_MEMDIE,
> - * which will allow it to gain access to memory reserves in
> - * the process of exiting and releasing its resources.
> - * Otherwise we could get an easy OOM deadlock.
> - */
> - if ((p->flags & PF_EXITING) && p->mm) {
> - if (p != current)
> - return ERR_PTR(-1UL);
> -
> - chosen = p;
> - *ppoints = ULONG_MAX;
> - }
> -
> points = badness(p, uptime.tv_sec);
> if (points > *ppoints || !chosen) {
> chosen = p;
> @@ -436,15 +418,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> if (printk_ratelimit())
> dump_header(p, gfp_mask, order, mem);
>
> - /*
> - * If the task is already exiting, don't alarm the sysadmin or kill
> - * its children or threads, just set TIF_MEMDIE so it can die quickly
> - */
> - if (p->flags & PF_EXITING) {
> - __oom_kill_process(p, mem, 0);
> - return 0;
> - }
> -
> printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> message, task_pid_nr(p), p->comm, points);
>
> --

2010-06-02 17:30:34

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

Why not just test TIF_MEMDIE?

Thanks,
Roland

2010-06-02 17:54:56

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

On 06/02, Roland McGrath wrote:
>
> Why not just test TIF_MEMDIE?

Because it is per-thread.

when select_bad_process() finds the task P to kill it can participate
in the core dump (sleep in exit_mm), but we should somehow inform the
thread which actually dumps the core: P->mm->core_state->dumper.

Well, we can use TIF_MEMDIE if we chose the right thread, I think.
But perhaps mm->flags |= MMF_OOM is better, it can have other user.
I dunno.

Oleg.

2010-06-02 18:59:10

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

> Because it is per-thread.

I see.

> when select_bad_process() finds the task P to kill it can participate
> in the core dump (sleep in exit_mm), but we should somehow inform the
> thread which actually dumps the core: P->mm->core_state->dumper.

Perhaps it should simply do that: if you would choose P to oom-kill, and
P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.

> Well, we can use TIF_MEMDIE if we chose the right thread, I think.
> But perhaps mm->flags |= MMF_OOM is better, it can have other user.
> I dunno.

This is all the quick hack before get around to just making core dumping
fully-interruptible, no? So we should go with whatever is the simplest
change now.

Perhaps this belongs in another thread as you suggested. But I wonder what
we might get just from s/TASK_UNINTERRUPTIBLE/TASK_KILLABLE/ in exit_mm.

Thanks,
Roland

2010-06-02 20:40:29

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

On 06/02, Roland McGrath wrote:
>
> > when select_bad_process() finds the task P to kill it can participate
> > in the core dump (sleep in exit_mm), but we should somehow inform the
> > thread which actually dumps the core: P->mm->core_state->dumper.
>
> Perhaps it should simply do that: if you would choose P to oom-kill, and
> P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.

... to set TIF_MEMDIE which should be checked in elf_core_dump().

Probably yes.

> > Well, we can use TIF_MEMDIE if we chose the right thread, I think.
> > But perhaps mm->flags |= MMF_OOM is better, it can have other user.
> > I dunno.
>
> This is all the quick hack before get around to just making core dumping
> fully-interruptible, no? So we should go with whatever is the simplest
> change now.

Yes.

> Perhaps this belongs in another thread as you suggested. But I wonder what
> we might get just from s/TASK_UNINTERRUPTIBLE/TASK_KILLABLE/ in exit_mm.

Oh. This needs more thinking. Definitely the task sleeping in exit_mm()
must not exit until core_state->dumper->thread returns from do_coredump().
If nothing else, the dumper can use its task_struct and it relies on
the stable core_thread->next list. And right now TASK_KILLABLE can't
work anyway, it is possible that fatal_signal_pending() is true.

But perhaps we can do something later. Assuming that do_coredump() is
interruptible, TASK_KILLABLE can make the difference only if the dumper
belongs to another thread-group.

Oleg.

2010-06-02 21:02:35

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] oom: remove PF_EXITING check completely

On Wed, 2 Jun 2010, Oleg Nesterov wrote:

> > Today, I've thought to make some bandaid patches for this issue. but
> > yes, I've reached the same conclusion.
> >
> > If we think multithread and core dump situation, all fixes are just
> > bandaid. We can't remove deadlock chance completely.
> >
> > The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> > doesn't have so much worth.
>
> Agreed! I was always wondering if it really helps in practice.
>

Nack, this certainly does help in practice, it prevents needlessly killing
additional tasks when one is exiting and may free memory. It's much
better to defer killing something temporarily if an eligible task (i.e.
one that has a high probability of memory allocations on current's nodes
or contributing to its memcg) is exiting.

We depend on this check specifically for our use of cpusets, so please
don't remove it.

2010-06-03 04:48:15

[permalink] [raw]

Subject: Re: [PATCH] oom: remove PF_EXITING check completely

> On Wed, 2 Jun 2010, Oleg Nesterov wrote:
>
> > > Today, I've thought to make some bandaid patches for this issue. but
> > > yes, I've reached the same conclusion.
> > >
> > > If we think multithread and core dump situation, all fixes are just
> > > bandaid. We can't remove deadlock chance completely.
> > >
> > > The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> > > doesn't have so much worth.
> >
> > Agreed! I was always wondering if it really helps in practice.
> >
>
> Nack, this certainly does help in practice, it prevents needlessly killing
> additional tasks when one is exiting and may free memory. It's much
> better to defer killing something temporarily if an eligible task (i.e.
> one that has a high probability of memory allocations on current's nodes
> or contributing to its memcg) is exiting.
>
> We depend on this check specifically for our use of cpusets, so please
> don't remove it.

Your claim violate our development process. Oleg pointed this check
doesn't only work well, but also can makes deadlock. So, We certinally
need anything fix. then, I'll remove this check completely at 2.6.35
timeframe.

But this doesn't mean we refuse you make better patch at all. I expect
we can merge very soon if you make such patch.

2010-06-03 06:29:09

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] oom: remove PF_EXITING check completely

On Thu, 3 Jun 2010, KOSAKI Motohiro wrote:

> > On Wed, 2 Jun 2010, Oleg Nesterov wrote:
> >
> > > > Today, I've thought to make some bandaid patches for this issue. but
> > > > yes, I've reached the same conclusion.
> > > >
> > > > If we think multithread and core dump situation, all fixes are just
> > > > bandaid. We can't remove deadlock chance completely.
> > > >
> > > > The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> > > > doesn't have so much worth.
> > >
> > > Agreed! I was always wondering if it really helps in practice.
> > >
> >
> > Nack, this certainly does help in practice, it prevents needlessly killing
> > additional tasks when one is exiting and may free memory. It's much
> > better to defer killing something temporarily if an eligible task (i.e.
> > one that has a high probability of memory allocations on current's nodes
> > or contributing to its memcg) is exiting.
> >
> > We depend on this check specifically for our use of cpusets, so please
> > don't remove it.
>
> Your claim violate our development process. Oleg pointed this check
> doesn't only work well, but also can makes deadlock. So, We certinally
> need anything fix. then, I'll remove this check completely at 2.6.35
> timeframe.
>

Show me your deadlock. I want to see it. In practice.

We've been using this check specifically for three years and it prevents
needlessly killing additional tasks when one is already exiting and will
free its memory. That's a crucial aspect of using cpusets that run out of
memory constantly.

Unless you actually have real world experience with using the oom killer
to affect a memory containment strategy, I don't buy into your overly
exaggerated claims that these are all bugfixes and these races that you
have no practical evidence to support actually even matter but speculate
based on pure code inspection are important.

2010-06-03 14:06:00

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

On 06/02, Oleg Nesterov wrote:
>
> On 06/02, Roland McGrath wrote:
> >
> > > when select_bad_process() finds the task P to kill it can participate
> > > in the core dump (sleep in exit_mm), but we should somehow inform the
> > > thread which actually dumps the core: P->mm->core_state->dumper.
> >
> > Perhaps it should simply do that: if you would choose P to oom-kill, and
> > P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.
>
> ... to set TIF_MEMDIE which should be checked in elf_core_dump().
>
> Probably yes.

Well, nothing can protect mm->core_state, the dumper owns it. Of course
we can add the locking, but this is not nice.

And again, perhaps MMF_OOMKILLED can be useful anyway.

So, I think this would be the most quick/simple fix for now.

Oleg.

2010-06-04 10:55:00

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

> On 06/02, Roland McGrath wrote:
> >
> > > when select_bad_process() finds the task P to kill it can participate
> > > in the core dump (sleep in exit_mm), but we should somehow inform the
> > > thread which actually dumps the core: P->mm->core_state->dumper.
> >
> > Perhaps it should simply do that: if you would choose P to oom-kill, and
> > P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.
>
> ... to set TIF_MEMDIE which should be checked in elf_core_dump().
>
> Probably yes.

Yep, probably. but can you please allow me additonal explanation?

In multi threaded OOM case, we have two problematic routine, coredump
and vmscan. Roland's idea can only solve the former.

But I also interest vmscan quickly exit if OOM received. if other threads
get stuck in vmscan for freeing addional pages (this is very typical because
usually every thread call any syscall and eventually call kmalloc etc),
recovering oom become very slow even if this doesn't makes deadlock.

Unfortunatelly, vmscan need much refactoring before appling this idea.
then, I didn't include such fixes.

I mean I hope to implement per-process OOM flag even if coredump don't
really need it.

So, I created MMF_OOM patch today. It is just for discussion, still.

(BFrom f099e1ba6e7b5654b35b468c13e1ae9e4f182ea4 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Fri, 4 Jun 2010 18:56:56 +0900
Subject: [RFC][PATCH v2] oom: make coredump interruptible

If oom victim process is under core dumping, sending SIGKILL cause
no-op. Unfortunately, coredump need relatively much memory. It mean
OOM vs coredump can makes deadlock.

Then, coredump logic should check the task has received SIGKILL
from OOM.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/binfmt_elf.c | 4 ++++
include/linux/sched.h | 1 +
mm/oom_kill.c | 3 ++-
3 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 535e763..2aca748 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2038,6 +2038,10 @@ static int elf_core_dump(struct coredump_params *cprm)
page_cache_release(page);
} else
stop = !dump_seek(cprm->file, PAGE_SIZE);
+
+ /* The task need to exit ASAP if received OOM. */
+ if (test_bit(MMF_OOM_KILLED, &current->mm->flags))
+ stop = 1;
if (stop)
goto end_coredump;
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8485aa2..53b7caa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -436,6 +436,7 @@ extern int get_dumpable(struct mm_struct *mm);
#endif
/* leave room for more dump flags */
#define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
+#define MMF_OOM_KILLED 17 /* Killed by OOM */

#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2678a04..29850c4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -401,7 +401,6 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
K(p->mm->total_vm),
K(get_mm_counter(p->mm, MM_ANONPAGES)),
K(get_mm_counter(p->mm, MM_FILEPAGES)));
- task_unlock(p);

/*
* We give our sacrificial lamb high priority and access to
@@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
+ set_bit(MMF_OOM_KILLED, &p->mm->flags);
+ task_unlock(p);

force_sig(SIGKILL, p);

--
1.6.5.2

2010-06-04 11:28:57

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

On 06/04, KOSAKI Motohiro wrote:
>
> > ... to set TIF_MEMDIE which should be checked in elf_core_dump().
> >
> > Probably yes.
>
> Yep, probably. but can you please allow me additonal explanation?
>
> In multi threaded OOM case, we have two problematic routine, coredump
> and vmscan. Roland's idea can only solve the former.
>
> But I also interest vmscan quickly exit if OOM received.

Yes, agreed. See another email from me, MMF_ flags looks "obviously
useful" to me.

(I'd suggest you to add a note into the changelog, to explain
that the new flag makes sense even without coredump problems).

> @@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + set_bit(MMF_OOM_KILLED, &p->mm->flags);
> + task_unlock(p);

IIUC, it has find_lock_task() mm above and thus we can trust p->mm ?
(I am asking just in case, I lost the plot a bit).

Ack or Reviewed, whatever your prefer.

Very minor nit.

> @@ -2038,6 +2038,10 @@ static int elf_core_dump(struct coredump_params *cprm)
> page_cache_release(page);
> } else
> stop = !dump_seek(cprm->file, PAGE_SIZE);
> +
> + /* The task need to exit ASAP if received OOM. */
> + if (test_bit(MMF_OOM_KILLED, &current->mm->flags))
> + stop = 1;

Perhaps this check makes more sense at the start of the loop,
and there is no need to set "stop = 1" (this var is not visible
outside of "for (;;) {}" anyway). Cosmetic, up to you.

Oleg.

2010-06-04 11:36:18

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

On 06/04, Oleg Nesterov wrote:
>
> (I'd suggest you to add a note into the changelog, to explain
> that the new flag makes sense even without coredump problems).

And. May I ask you to add another note into the changelog?

> > @@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
> > */
> > p->rt.time_slice = HZ;
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > + set_bit(MMF_OOM_KILLED, &p->mm->flags);

I think the changelog should explain that, if we race with fork(),
this flag can't leak into the child's new mm. mm_init() filters
the bits outside of MMF_INIT_MASK.

If we race with exec, it can't leak because mm_alloc() does
memset(0).

Oleg.

2010-06-09 19:55:19

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

On 06/04, Oleg Nesterov wrote:
>
> On 06/04, KOSAKI Motohiro wrote:
> >
> > In multi threaded OOM case, we have two problematic routine, coredump
> > and vmscan. Roland's idea can only solve the former.
> >
> > But I also interest vmscan quickly exit if OOM received.
>
> Yes, agreed. See another email from me, MMF_ flags looks "obviously
> useful" to me.

Well. But somehow we forgot about the !coredumping case... Suppose
that select_bad_process() chooses the process P to kill and we have
other processes (not sub-threads) which share the same ->mm.

In that case I am not sure we should blindly set MMF_OOMKILL. Suppose
that we kill P and after that the "out-of-memory" condition goes away.
But its ->mm still has MMF_OOMKILL set, and it is used. Who/when will
clear this flag?

Perhaps something like below makes sense for now.

Oleg.

--- x/fs/exec.c
+++ x/fs/exec.c
@@ -1594,6 +1594,7 @@ static inline int zap_threads(struct tas
spin_lock_irq(&tsk->sighand->siglock);
if (!signal_group_exit(tsk->signal)) {
mm->core_state = core_state;
+ set_bit(MMF_COREDUMP, &mm->flags);
nr = zap_process(tsk, exit_code);
}
spin_unlock_irq(&tsk->sighand->siglock);
--- x/fs/binfmt_elf.c
+++ x/fs/binfmt_elf.c
@@ -2028,6 +2028,9 @@ static int elf_core_dump(struct coredump
struct page *page;
int stop;

+ if (!test_bit(MMF_COREDUMP, &current->mm->flags))
+ goto end_coredump;
+
page = get_dump_page(addr);
if (page) {
void *kaddr = kmap(page);
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);

+ clear_bit(MMF_COREDUMP, &p->mm->flags);
force_sig(SIGKILL, p);
}

2010-06-09 20:42:03

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

On Wed, 9 Jun 2010, Oleg Nesterov wrote:

> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
>
> + clear_bit(MMF_COREDUMP, &p->mm->flags);
> force_sig(SIGKILL, p);
> }
>

This requires task_lock(p).

2010-06-09 21:06:14

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

On 06/09, David Rientjes wrote:
>
> On Wed, 9 Jun 2010, Oleg Nesterov wrote:
>
> > --- x/mm/oom_kill.c
> > +++ x/mm/oom_kill.c
> > @@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
> > p->rt.time_slice = HZ;
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > + clear_bit(MMF_COREDUMP, &p->mm->flags);
> > force_sig(SIGKILL, p);
> > }
> >
>
> This requires task_lock(p).

Yes, yes, sure. This is only template. I'll wait for the next mmotm
to send the actual patch on top of recent changes. Unless Kosaki/Roland
have other ideas.

Imho, we really need to fix the coredump/oom problem.

Oleg.

2010-06-13 11:25:43

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

Sorry for the delay.

> On 06/04, Oleg Nesterov wrote:
> >
> > On 06/04, KOSAKI Motohiro wrote:
> > >
> > > In multi threaded OOM case, we have two problematic routine, coredump
> > > and vmscan. Roland's idea can only solve the former.
> > >
> > > But I also interest vmscan quickly exit if OOM received.
> >
> > Yes, agreed. See another email from me, MMF_ flags looks "obviously
> > useful" to me.
>
> Well. But somehow we forgot about the !coredumping case... Suppose
> that select_bad_process() chooses the process P to kill and we have
> other processes (not sub-threads) which share the same ->mm.

Ah, yes. I think you are correct.

> In that case I am not sure we should blindly set MMF_OOMKILL. Suppose
> that we kill P and after that the "out-of-memory" condition goes away.
> But its ->mm still has MMF_OOMKILL set, and it is used. Who/when will
> clear this flag?
>
> Perhaps something like below makes sense for now.

Probably, this works. at least I don't find any problems.
But umm... Do you mean we can't implement per-process oom flags?

example,
1) back to implement signal->oom_victim
because We are using SIGKILL for OOM and struct signal
naturally represent signal target.
2) mm->nr_oom_killed_task
just avoid simple flag. instead counting number of tasks of
oom-killed.

I think both avoid your explained problem. Am I missing something?

But, again, I have no objection to your patch. because I really hope to
fix coredump vs oom issue.

2010-06-13 15:55:40

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

On 06/13, KOSAKI Motohiro wrote:
>
> > On 06/04, Oleg Nesterov wrote:
> > >
> > Perhaps something like below makes sense for now.
>
> Probably, this works. at least I don't find any problems.
> But umm... Do you mean we can't implement per-process oom flags?

Sorry, can't understand what you mean.

> example,
> 1) back to implement signal->oom_victim
> because We are using SIGKILL for OOM and struct signal
> naturally represent signal target.

Yes, but if this process participates in the coredump, we should find
the right thread, or mark mm or mm->core_state.

In fact, I was never sure that oom-kill should kill the single process.
Perhaps it should kill all tasks using the same ->mm instead. But this
is another story.

> 2) mm->nr_oom_killed_task
> just avoid simple flag. instead counting number of tasks of
> oom-killed.

again, can't understand.

> I think both avoid your explained problem. Am I missing something?

I guess that I am missing something ;) Please clarify?

> But, again, I have no objection to your patch. because I really hope to
> fix coredump vs oom issue.

Yes, I think this is important. And if we keep the PF_EXITING check in
select_bad_process(), it should be fixed so that at least the coredump
can't fool it. And the "p != current" is obviously not right too.

I'll try to do something next week, the patches should be simple.

Oleg.

2010-06-13 17:15:22

[permalink] [raw]

Subject: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)

On 06/13, Oleg Nesterov wrote:
>
> On 06/13, KOSAKI Motohiro wrote:
> >
> > But, again, I have no objection to your patch. because I really hope to
> > fix coredump vs oom issue.
>
> Yes, I think this is important.

Oh. And another problem, vfork() is not interruptible too. This means
that the user can hide the memory hog from oom-killer. But let's forget
about oom.

Roland, any reason it should be uninterruptible? This doesn't look good
in any case. Perhaps the pseudo-patch below makes sense?

Oleg.

--- x/kernel/fork.c
+++ x/kernel/fork.c
@@ -1359,6 +1359,26 @@ struct task_struct * __cpuinit fork_idle
return task;
}

+// ---------------------------------------------------
+// THIS SHOULD BE USED BY mm_release/coredump_wait/etc
+// ---------------------------------------------------
+void complete_vfork_done(struct task_struct *tsk)
+{
+ struct completion *vfork = xchg(tsk->vfork_done, NULL);
+ if (vfork)
+ complete(vfork);
+}
+
+static wait_for_vfork_done(struct task_struct *child, struct completion *vfork)
+{
+ if (!wait_for_completion_killable(vfork))
+ return;
+ if (xchg(child->vfork_done, NULL) != NULL)
+ return;
+ // the child has already read ->vfork_done and it should wake us up
+ wait_for_completion(vfork);
+}
+
/*
* Ok, this is the main fork-routine.
*
@@ -1433,6 +1453,7 @@ long do_fork(unsigned long clone_flags,
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
+ get_task_struct(p);
}

audit_finish_fork(p);
@@ -1462,7 +1483,8 @@ long do_fork(unsigned long clone_flags,

if (clone_flags & CLONE_VFORK) {
freezer_do_not_count();
- wait_for_completion(&vfork);
+ wait_for_vfork_done(p, &vfork);
+ put_task_struct(p),
freezer_count();
tracehook_report_vfork_done(p, nr);
}

2010-06-14 00:27:15

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

> Oh. This needs more thinking. Definitely the task sleeping in exit_mm()
> must not exit until core_state->dumper->thread returns from do_coredump().
> If nothing else, the dumper can use its task_struct and it relies on
> the stable core_thread->next list. And right now TASK_KILLABLE can't
> work anyway, it is possible that fatal_signal_pending() is true.

Yes, I was right to say this should be another thread. Let's not get into
all this right now. I think it is mostly orthogonal to the oom_kill issue.

Thanks,
Roland

2010-06-14 00:36:19

[permalink] [raw]

Subject: Re: [PATCH] oom: Make coredump interruptible

> > 1) back to implement signal->oom_victim
> > because We are using SIGKILL for OOM and struct signal
> > naturally represent signal target.
>
> Yes, but if this process participates in the coredump, we should find
> the right thread, or mark mm or mm->core_state.
>
> In fact, I was never sure that oom-kill should kill the single process.
> Perhaps it should kill all tasks using the same ->mm instead. But this
> is another story.

Indeed. But as long as oom_kill acts on process granularity, I don't think
we should have it set an mm-granularity flag. That calculus changes if a
core dump is actually in progress, since that is already definitely going
to kill all tasks using that mm. When no dump is in progress, it feels
wrong to leave any state change in mm, since the other mm-sharers were not
affected.

Thanks,
Roland

2010-06-14 00:56:21

[permalink] [raw]

Subject: Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)

> Oh. And another problem, vfork() is not interruptible too. This means
> that the user can hide the memory hog from oom-killer.

I'm not sure there is really any danger like that, because of the
oom_kill_process "Try to kill a child first" logic. Eventually the vfork
child will be chosen and killed, and when it finally exits that will
release the vfork wait. So if the vfork parent is really the culprit,
it will then be subject to oom_kill_process sooner or later.

> But let's forget about oom.

Sure, but it reminds me to mention that vfork mm sharing is another reason
that having oom_kill set some persistent state in the mm seems wrong. If a
vfork child is chosen for oom_kill and killed, then it's possible that will
relieve the need (e.g. much memory was held indirectly via its fd table or
whatnot else that is not shared with the parent via mm). So once the child
is dead, there should not be any lingering bits in the parent's mm.

> Roland, any reason it should be uninterruptible? This doesn't look good
> in any case. Perhaps the pseudo-patch below makes sense?

I've long thought that we should make a vfork parent SIGKILL-able. (Of
course the vfork wait can't be made interruptible by other signals, since
it must never do anything userish like signal handler setup until the child
has died or exec'd.) I don't know off hand of any problem with your
straightforward change. But I don't have much confidence that there isn't
any strange gotcha waiting there due to some other kind of implicit
assumption about vfork parent blocks that we are overlooking at the moment.
So I wouldn't change this without more thorough auditing and thinking about
everything related to vfork.

Personally, what I've really been interested in is changing the vfork wait
to use some different kind of blocking entirely. My real motivation for
that is to let a vfork wait be morphed into and out of TASK_TRACED, so a
debugger can examine its registers and so forth. That would entail letting
the vfork/clone syscall return fully back to the asm level so it could stop
in a proper state some place like the syscall-exit or notify-resume points.
However, that has other wrinkles on machines like sparc and ia64, where
user_regset access can involve user memory access. Since we can't allow
those while the user memory is still shared with the child, it might not
really be practical at all.

Thanks,
Roland

2010-06-14 17:35:24

[permalink] [raw]

Subject: Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)

On 06/13, Roland McGrath wrote:
>
> > Oh. And another problem, vfork() is not interruptible too. This means
> > that the user can hide the memory hog from oom-killer.
>
> I'm not sure there is really any danger like that, because of the
> oom_kill_process "Try to kill a child first" logic.

But note that oom_kill_process() doesn't kill the children with the
same ->mm. I never understood this code.

Anyway I agree. Even if I am right, this is not very serious problem
from oom-kill pov. To me, the uninterruptible CLONE_VFORK is bad by
itself.

> > But let's forget about oom.
>
> Sure, but it reminds me to mention that vfork mm sharing is another reason
> that having oom_kill set some persistent state in the mm seems wrong.

Yes, yes, this was already discussed a bit. Only if the core dump is in
progress we can touch ->mm or (probably better but needs a bit more locking)
mm->core_state to signal the coredumping thread and (perhaps) for something
else.

> > Roland, any reason it should be uninterruptible? This doesn't look good
> > in any case. Perhaps the pseudo-patch below makes sense?
>
> I've long thought that we should make a vfork parent SIGKILL-able.

Good ;)

> (Of
> course the vfork wait can't be made interruptible by other signals, since
> it must never do anything userish

Yes sure. That is why wait_for_completion_killable(), not _interrutpible.
But I assume you didn't mean that only SIGKILL should interrupt the
parent, any sig_fatal() signal should.

> I don't know off hand of any problem with your
> straightforward change. But I don't have much confidence that there isn't
> any strange gotcha waiting there due to some other kind of implicit
> assumption about vfork parent blocks that we are overlooking at the moment.
> So I wouldn't change this without more thorough auditing and thinking about
> everything related to vfork.

Agreed. This needs auditing. And CLONE_VFORK can be used with/without all
other CLONE_ flags... Probably we should mostly worry about vfork ==
CLONE_VM | CLONE_VFORK case.

Anyway. ->vfork_done is per-thread. This means that without any changes
do_fork(CLONE_VFORK) can return (to user-mode) before the child's thread
group exits/execs. Perhaps this means we shouldn't worry too much.

> Personally, what I've really been interested in is changing the vfork wait
> to use some different kind of blocking entirely. My real motivation for
> that is to let a vfork wait be morphed into and out of TASK_TRACED,

I see. I never thought about this, but I think you are right.

Hmm. Even without debugger, the parent doesn't react to SIGSTOP. Say,

int main(voif)
{
if (!vfork())
pause();
}

and ^Z won't work obviously. Not good.

This is not trivail I guess. Needs thinking...

Oleg.

2010-06-14 19:17:47

[permalink] [raw]

Subject: Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)

> But note that oom_kill_process() doesn't kill the children with the
> same ->mm. I never understood this code.

Yes, odd. This is the first time I've really looked at oom_kill.

> Anyway I agree. Even if I am right, this is not very serious problem
> from oom-kill pov. To me, the uninterruptible CLONE_VFORK is bad by
> itself.

Agreed.

> Yes sure. That is why wait_for_completion_killable(), not _interrutpible.

Right, your code was fine. I was just being pedantic for the record since
you said "interruptible" in the text.

> But I assume you didn't mean that only SIGKILL should interrupt the
> parent, any sig_fatal() signal should.

Yes.

> Agreed. This needs auditing. And CLONE_VFORK can be used with/without all
> other CLONE_ flags... Probably we should mostly worry about vfork ==
> CLONE_VM | CLONE_VFORK case.

Yes. I hope it is fine to make clone refuse CLONE_VFORK set without
CLONE_VM in the future as a sanity check. I don't think any use of
CLONE_VFORK other than the actual vfork use is something we ever intended
to support.

> Anyway. ->vfork_done is per-thread. This means that without any changes
> do_fork(CLONE_VFORK) can return (to user-mode) before the child's thread
> group exits/execs. Perhaps this means we shouldn't worry too much.

You mean some other thread in the parent's group can run in user mode.
Yes. The real reason for the vfork wait is just that the parent/child will
share the user stack memory, so in practice it's fine if other threads with
other stacks are touching other memory (i.e. it's just the user's problem).

> Hmm. Even without debugger, the parent doesn't react to SIGSTOP.

Yes. It's been a long time since I thought about the vfork stuff much.
But I now recall thinking about the SIGSTOP/SIGTSTP issue too. It does
seem bad. OTOH, it has lurked there for many years now without complaints.

Note that supporting stop/fatal signals in the normal way means that the
call has to return and pass the syscall-exit tracing point first. This
means a change in the order of events seen by a debugger. It also
complicates the subject of PTRACE_EVENT_VFORK_DONE reports, which today
happen before syscall-exit or signal stuff is possible. For proper
stopping in the normal way, the vfork-wait would be restarted via
sys_restart_syscall or something. But the way that happens ordinarily is
to get all the way back to user mode and reenter with a normal syscall.
That doesn't touch the user stack itself, but it sure makes one nervous.
It's hard to see how we could ever do that and then prevent normal signals
from being handled before the restart. (Instead, we'd have the actual
blocking done inside get_signal_to_deliver so we just never get to user
mode until the vfork hold is released, and not actually need to restart.)
So there are multiple cans of worms cascading from a change, even though
the actual work to do the block in a new way might not be very complex.

It all seems kind of doable, at least if we accept a change in the userland
debugger experience of which ptrace reports a vfork parent might make in
what order. But plenty of hair to worry about.

Thanks,
Roland

2010-06-28 17:35:40

[permalink] [raw]

Subject: Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)

On 06/14, Roland McGrath wrote:
>
> > Hmm. Even without debugger, the parent doesn't react to SIGSTOP.
>
> Yes. It's been a long time since I thought about the vfork stuff much.
> But I now recall thinking about the SIGSTOP/SIGTSTP issue too. It does
> seem bad. OTOH, it has lurked there for many years now without complaints.
>
> Note that supporting stop/fatal signals in the normal way means that the
> call has to return and pass the syscall-exit tracing point first. This
> means a change in the order of events seen by a debugger. It also
> complicates the subject of PTRACE_EVENT_VFORK_DONE reports, which today
> happen before syscall-exit or signal stuff is possible. For proper
> stopping in the normal way, the vfork-wait would be restarted via
> sys_restart_syscall or something.

Yes. I was thinking about this too.

The parent can play with real_blocked or saved_sigmask to block all
signals except STOP and KILL, use TASK_INTERRUPTIBLE for wait, and
just return ERESTART each time it gets the signal (it should clear
child->vfork_done if fatal_signal_pending).

We should also check PF_KTHREAD though, there are in kernel users
of CLONE_VFORK.

> Bu the way that happens ordinarily is
> to get all the way back to user mode and reenter with a normal syscall.
> That doesn't touch the user stack itself, but it sure makes one nervous.

me too. Especially because I do not really know how !x86 machines
implement this all.

We should also verify that the exiting/stopping parent can never write
to its ->mm. For example, exit_mm() does put_user(tsk->clear_child_tid).
Fortunately we can rely on PF_SIGNALED flag in this case.

Oleg.

2010-06-28 18:04:30