LinuxLists.cc - [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core

2008-07-16 13:51:30

Subject: [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state

With the previous changes the sub-threads which participate in coredump do
not need to have the valid ->mm when the coredump is in progress, now we
can decouple exit_mm() from coredumping code.

Change exit_mm() to clear ->mm first, then play with mm->core_state. This
simplifies the code because we can avoid unlock/lock games with ->mmap_sem,
and more importantly this makes the coredumping process visible to oom_kill.
Currently the PF_EXITING task can sleep with ->mm != NULL "unpredictably"
long.

The patch moves the coredumping code to the new function for the readability.

Signed-off-by: Oleg Nesterov <[email protected]>

kernel/exit.c | 47 +++++++++++++++++++++++++----------------------
fs/exec.c | 4 ++--
2 files changed, 27 insertions(+), 24 deletions(-)

--- 26-rc2/kernel/exit.c~7_CLEAR_MM_FIRST 2008-07-15 20:25:48.000000000 +0400
+++ 26-rc2/kernel/exit.c 2008-07-15 20:24:50.000000000 +0400
@@ -646,6 +646,28 @@ assign_new_owner:
}
#endif /* CONFIG_MM_OWNER */

+static void exit_coredump(struct task_struct * tsk,
+ struct core_state *core_state)
+{
+ struct core_thread self;
+
+ self.task = tsk;
+ self.next = xchg(&core_state->dumper.next, &self);
+ /*
+ * Implies mb(), the result of xchg() must be visible
+ * to core_state->dumper.
+ */
+ if (atomic_dec_and_test(&core_state->nr_threads))
+ complete(&core_state->startup);
+
+ for (;;) {
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+ if (!self.task) /* see coredump_finish() */
+ break;
+ schedule();
+ }
+ __set_task_state(tsk, TASK_RUNNING);
+}
/*
* Turn us into a lazy TLB process if we
* aren't already..
@@ -667,28 +689,6 @@ static void exit_mm(struct task_struct *
*/
down_read(&mm->mmap_sem);
core_state = mm->core_state;
- if (core_state) {
- struct core_thread self;
- up_read(&mm->mmap_sem);
-
- self.task = tsk;
- self.next = xchg(&core_state->dumper.next, &self);
- /*
- * Implies mb(), the result of xchg() must be visible
- * to core_state->dumper.
- */
- if (atomic_dec_and_test(&core_state->nr_threads))
- complete(&core_state->startup);
-
- for (;;) {
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (!self.task) /* see coredump_finish() */
- break;
- schedule();
- }
- __set_task_state(tsk, TASK_RUNNING);
- down_read(&mm->mmap_sem);
- }
atomic_inc(&mm->mm_count);
BUG_ON(mm != tsk->active_mm);
/* more a memory barrier than a real lock */
@@ -701,6 +701,9 @@ static void exit_mm(struct task_struct *
task_unlock(tsk);
mm_update_next_owner(mm);
mmput(mm);
+
+ if (core_state)
+ exit_coredump(tsk, core_state);
}

static void
--- 26-rc2/fs/exec.c~7_CLEAR_MM_FIRST 2008-07-15 17:54:45.000000000 +0400
+++ 26-rc2/fs/exec.c 2008-07-15 20:24:50.000000000 +0400
@@ -1632,8 +1632,8 @@ static void coredump_finish(struct mm_st
next = curr->next;
task = curr->task;
/*
- * see exit_mm(), curr->task must not see
- * ->task == NULL before we read ->next.
+ * see exit_coredump(), curr->task must not
+ * see ->task == NULL before we read ->next.
*/
smp_mb();
curr->task = NULL;

2008-07-20 02:33:17

by Roland McGrath

[permalink] [raw]

Subject: Re: [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state

> With the previous changes the sub-threads which participate in coredump do
> not need to have the valid ->mm when the coredump is in progress, now we
> can decouple exit_mm() from coredumping code.

I'm all for separating the code more cleanly. But I don't think it can
work to change the order of the operations, i.e. it is not really true that
core dumps don't need each thread's ->mm link to be valid. Is there a
benefit to unlinking the mm before waiting for the core dump to finish?

The issue is that the user_regset calls to get "thread state" might
actually read some user memory. Those calls use a task_struct pointer and
you don't get to separately tell them the mm_struct describing the thread's
address space. For example, the sparc64 "general registers" note for core
files includes the register window read from user memory.

So, it's not OK to clear the ->mm before everything examining the thread's
machine state is really done, i.e. core dump and anything else.

Thanks,
Roland

2008-07-20 12:20:36

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state

On 07/19, Roland McGrath wrote:
>
> > With the previous changes the sub-threads which participate in coredump do
> > not need to have the valid ->mm when the coredump is in progress, now we
> > can decouple exit_mm() from coredumping code.
>
> I'm all for separating the code more cleanly. But I don't think it can
> work to change the order of the operations, i.e. it is not really true that
> core dumps don't need each thread's ->mm link to be valid. Is there a
> benefit to unlinking the mm before waiting for the core dump to finish?

If select_bad_process() sees the PF_EXITING task with ->mm != NULL, it
returns ERR_PTR(-1). This means that any prcoess doing the mt coredump
blocks oom kill completely. It is not that oom_kill doesn't take this
process into account, oom_kill just can't work intil ->core_dump()
completes.

Yes, oom_kill.c in turn need fixes but still this is not nice, and I
personally hate this coredump code in the middle of exit_mm().

However,

> The issue is that the user_regset calls to get "thread state" might
> actually read some user memory. Those calls use a task_struct pointer and
> you don't get to separately tell them the mm_struct describing the thread's
> address space. For example, the sparc64 "general registers" note for core
> files includes the register window read from user memory.
>
> So, it's not OK to clear the ->mm before everything examining the thread's
> machine state is really done, i.e. core dump and anything else.

Oh, thanks Roland.

Andrew, please drop

coredump-binfmt_elf_fdpic-dont-use-sub-threads-mm.patch
coredump-exit_mm-clear-mm-first-then-play-with-core_state.patch

btw, arch/sparc64/kernel/ptrace.c has a lot of

if (target == current)
copy_xxx_user();
else
access_process_vm();

perhaps it make sense to make a helper. Just curious (I don't know what
regset is), is it possible that ->get() is called when target->mm == NULL?

Oleg.

2008-07-21 00:10:45

by Roland McGrath

[permalink] [raw]

Subject: Re: [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state

> Yes, oom_kill.c in turn need fixes but still this is not nice, and I
> personally hate this coredump code in the middle of exit_mm().

I agree. I think what we should work towards is having the coredump
synchronization take place earlier (the tracehook_report_exit point should
be fine, i.e. before PF_EXITING). Then have the dumping and the waiting
for it be killable. I don't think we can get there in only one or two
steps, though.

[The rest if quite off-topic, please do it separately.]

> btw, arch/sparc64/kernel/ptrace.c has a lot of
>
> if (target == current)
> copy_xxx_user();
> else
> access_process_vm();
>
> perhaps it make sense to make a helper.

Dave actually has get_from_target and set_to_target helpers for that.
The places they aren't used, I assume are either just older code not
yet streamlined, or places where the separate get/put_user calls perform
especially better than copy_xxx_user (you'd have to ask Dave). If multiple
arch ports find such helpers useful, they could move into common code later.

> Just curious (I don't know what regset is), is it possible that ->get()
> is called when target->mm == NULL?

It should not happen. It's only kosher to use user_regset calls on a task
in a known state (like ptrace stop, or current), and never on kernel threads.

Thanks,
Roland

2008-07-21 16:18:26

by Oleg Nesterov

[permalink] [raw]

Subject: killable/interruptible coredumps

On 07/20, Roland McGrath wrote:
>
> Then have the dumping and the waiting
> for it be killable.

I think it is easy to make the coredumping killable right now,

--- fs/exec.c~ 2008-07-21 19:47:22.000000000 +0400
+++ fs/exec.c 2008-07-21 19:56:44.000000000 +0400
@@ -1523,6 +1523,7 @@ static inline int zap_threads(struct tas
spin_lock_irq(&tsk->sighand->siglock);
if (!signal_group_exit(tsk->signal)) {
mm->core_state = core_state;
+ clear_thread_flag(TIF_SIGPENDING);
tsk->signal->group_exit_code = exit_code;
nr = zap_process(tsk);
}
@@ -1735,12 +1736,6 @@ int do_coredump(long signr, int exit_cod
goto fail;

/*
- * Clear any false indication of pending signals that might
- * be seen by the filesystem code called to write the core file.
- */
- clear_thread_flag(TIF_SIGPENDING);
-
- /*
* lock_kernel() because format_corename() is controlled by sysctl, which
* uses lock_kernel()
*/
--- fs/binfmt_elf.c~ 2008-07-13 20:52:25.000000000 +0400
+++ fs/binfmt_elf.c 2008-07-20 19:53:08.000000000 +0400
@@ -1105,11 +1105,17 @@ out:
*/
static int dump_write(struct file *file, const void *addr, int nr)
{
+ if (fatal_signal_pending(current))
+ return 0;
+
return file->f_op->write(file, addr, nr, &file->f_pos) == nr;
}

static int dump_seek(struct file *file, loff_t off)
{
+ if (fatal_signal_pending(current))
+ return 0;
+
if (file->f_op->llseek && file->f_op->llseek != no_llseek) {
if (file->f_op->llseek(file, off, SEEK_CUR) < 0)
return 0;

This relies on the fact SIGKILL will find the coredumping thread
because it is not PF_EXITING, even if SIGNAL_GROUP_EXIT is set. This
is a bit racy though, SIGKILL may come after zap_process(current)
but before other threads have "died".

I wonder if it makes sense to go futher. Currently the coredumping
thread sets SIGNAL_GROUP_EXIT. We can set signal->group_exit_task
instead, this way it can be killed by any fatal signal, not only by
SIGKILL. For example the user can just use ^C (unless it is ignored/etc).

What is your opinion about the patch below? I can't decide wether
this change is good or bad. It complicates the code, but otoh it
makes things more consistent, imho. Note also we can change oom_kill
to check SIGNAL_GROUP_EXIT instead of PF_EXITING to detect the process
which should release its ->mm and probably free the memory "soon".

It is not clear what should be ->group_exit_code if ->core_dump() is
interrupted. This patch sets it = exit_code, but perhaps we should
keep the original signr in that case.

Oleg.

--- 26-rc2/fs/exec.c~1_MAKE_KILLABLE 2008-07-20 17:31:40.000000000 +0400
+++ 26-rc2/fs/exec.c 2008-07-20 19:46:21.000000000 +0400
@@ -1498,7 +1498,6 @@ static int zap_process(struct task_struc
struct task_struct *t;
int nr = 0;

- start->signal->flags = SIGNAL_GROUP_EXIT;
start->signal->group_stop_count = 0;

t = start;
@@ -1523,7 +1522,8 @@ static inline int zap_threads(struct tas
spin_lock_irq(&tsk->sighand->siglock);
if (!signal_group_exit(tsk->signal)) {
mm->core_state = core_state;
- tsk->signal->group_exit_code = exit_code;
+ clear_thread_flag(TIF_SIGPENDING);
+ tsk->signal->group_exit_task = tsk;
nr = zap_process(tsk);
}
spin_unlock_irq(&tsk->sighand->siglock);
@@ -1574,6 +1574,7 @@ static inline int zap_threads(struct tas
if (p->mm) {
if (unlikely(p->mm == mm)) {
lock_task_sighand(p, &flags);
+ p->signal->flags = SIGNAL_GROUP_EXIT;
nr += zap_process(p);
unlock_task_sighand(p, &flags);
}
@@ -1619,7 +1620,7 @@ fail:
return core_waiters;
}

-static void coredump_finish(struct mm_struct *mm)
+static inline void core_state_finish(struct mm_struct *mm)
{
struct core_thread *curr, *next;
struct task_struct *task;
@@ -1640,6 +1641,17 @@ static void coredump_finish(struct mm_st
mm->core_state = NULL;
}

+static void coredump_finish(struct mm_struct *mm, int exit_code)
+{
+ spin_lock_irq(&current->sighand->siglock);
+ current->signal->flags = SIGNAL_GROUP_EXIT;
+ current->signal->group_exit_code = exit_code;
+ current->signal->group_exit_task = NULL;
+ spin_unlock_irq(&current->sighand->siglock);
+
+ core_state_finish(mm);
+}
+
/*
* set_dumpable converts traditional three-value dumpable to two flags and
* stores them into mm->flags. It modifies lower two bits of mm->flags, but
@@ -1735,12 +1747,6 @@ int do_coredump(long signr, int exit_cod
goto fail;

/*
- * Clear any false indication of pending signals that might
- * be seen by the filesystem code called to write the core file.
- */
- clear_thread_flag(TIF_SIGPENDING);
-
- /*
* lock_kernel() because format_corename() is controlled by sysctl, which
* uses lock_kernel()
*/
@@ -1814,9 +1820,8 @@ int do_coredump(long signr, int exit_cod
goto close_fail;

retval = binfmt->core_dump(signr, regs, file, core_limit);
-
if (retval)
- current->signal->group_exit_code |= 0x80;
+ exit_code |= 0x80;
close_fail:
filp_close(file, NULL);
fail_unlock:
@@ -1824,7 +1829,7 @@ fail_unlock:
argv_free(helper_argv);

current->fsuid = fsuid;
- coredump_finish(mm);
+ coredump_finish(mm, exit_code);
fail:
return retval;
}