2007-11-27 13:20:36

by Andrea Arcangeli

[permalink] [raw]
Subject: /proc dcache deadlock in do_exit

Hi,

this patch fixes a sles9 system hang in start_this_handle from a
customer with some heavy workload where all tasks are waiting on
kjournald to commit the transaction, but kjournald waits on t_updates
to go down to zero (it never does). This was reported as a lowmem
shortage deadlock but when checking the debug data I noticed the VM
wasn't under pressure at all (well it was really under vm pressure,
because lots of tasks hanged in the VM prune_dcache methods trying to
flush dirty inodes, but no task was hanging in GFP_NOFS mode, the
holder of the journal handle should have if this was a vm issue in the
first place). No task was apparently holding the leftover handle in
the committing transaction, so I deduced t_updates was stuck to 1
because a journal_stop was never run by some path (this turned out to
be correct). With a debug patch adding proper reverse links and stack
trace logging in ext3 deployed in production, I found journal_stop is
never run because mark_inode_dirty_sync is called inside release_task
called by do_exit. (that was quite fun because I would have never
thought about this subtleness, I thought a regular path in ext3 had a
bug and it forgot to call journal_stop)

do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
come back to run journal_stop)

The reason is that shrink_dcache_parent is racy by design (feature not
a bug) and it can do blocking I/O in some case, but the point is that
calling shrink_dcache_parent at the last stage of do_exit isn't safe
for self-reaping tasks.

I guess the memory pressure of the unbalanced highmem system allowed
to trigger this more easily.

Now mainline doesn't have this line in iput (like sles9 has):

if (inode->i_state & I_DIRTY_DELAYED)
mark_inode_dirty_sync(inode);

so it will probably not crash with ext3, but for example ext2
implements an I/O-blocking ext2_put_inode that will lead to similar
screwups with ext2_free_blocks never coming back and it's definitely
wrong to call blocking-IO paths inside do_exit. So this should fix a
subtle bug in mainline too (not verified in practice though). The
equivalent fix for ext3 is also not verified yet to fix the problem in
sles9 but I don't have doubt it will (it usually takes days to crash,
so it'll take weeks to be sure).

An alternate fix would be to offload that work to a kernel thread, but
I don't think a reschedule for this is worth it, the vm should be able
to collect those entries for the synchronous release_task.

Signed-off-by: Andrea Arcangeli <[email protected]>

diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2265,7 +2265,8 @@ static void proc_flush_task_mnt(struct v
name.len = snprintf(buf, sizeof(buf), "%d", pid);
dentry = d_hash_and_lookup(mnt->mnt_root, &name);
if (dentry) {
- shrink_dcache_parent(dentry);
+ if (!(current->flags & PF_EXITING))
+ shrink_dcache_parent(dentry);
d_drop(dentry);
dput(dentry);
}


here the debugging code I written to track this deadlock down if
anyone is interested. (I should really have used unsigned long _stack
to make life easier).

Index: sles9/fs/jbd/transaction.c
--- sles9/fs/jbd/transaction.c.~1~ 2007-10-26 00:18:21.000000000 +0200
+++ sles9/fs/jbd/transaction.c 2007-11-19 19:49:30.000000000 +0100
@@ -222,7 +222,19 @@ repeat_locked:
handle->h_transaction = transaction;
transaction->t_outstanding_credits += nblocks;
transaction->t_updates++;
- transaction->t_handle_count++;
+ handle->id = transaction->t_handle_count++;
+ if (handle->id < NR_HANDLES)
+ transaction->handles[handle->id] = handle;
+ {
+ char _stack;
+ char * start = &_stack, * end, * stack;
+ end = (char *) (((unsigned long) start+THREAD_SIZE-1) & (-THREAD_SIZE));
+ stack = (char *) ((unsigned long) start + 2048);
+ if (stack > end)
+ stack = end;
+ memcpy(handle->stack, start, stack-start);
+ }
+
jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n",
handle, nblocks, transaction->t_outstanding_credits,
__log_space_left(journal));
@@ -398,6 +410,8 @@ int journal_restart(handle_t *handle, in
spin_lock(&transaction->t_handle_lock);
transaction->t_outstanding_credits -= handle->h_buffer_credits;
transaction->t_updates--;
+ if (handle->id < NR_HANDLES)
+ transaction->handles[handle->id] = NULL;

if (!transaction->t_updates)
wake_up(&journal->j_wait_updates);
@@ -1362,6 +1376,8 @@ int journal_stop(handle_t *handle)
spin_lock(&transaction->t_handle_lock);
transaction->t_outstanding_credits -= handle->h_buffer_credits;
transaction->t_updates--;
+ if (handle->id < NR_HANDLES)
+ transaction->handles[handle->id] = NULL;
if (!transaction->t_updates) {
wake_up(&journal->j_wait_updates);
if (journal->j_barrier_count)
Index: sles9/include/linux/jbd.h
--- sles9/include/linux/jbd.h.~1~ 2007-10-26 00:18:21.000000000 +0200
+++ sles9/include/linux/jbd.h 2007-11-19 19:40:34.000000000 +0100
@@ -386,6 +386,8 @@ struct handle_s
{
/* Which compound transaction is this update a part of? */
transaction_t *h_transaction;
+ int id;
+ char stack[2048];

/* Number of remaining buffers we are allowed to dirty: */
int h_buffer_credits;
@@ -569,6 +571,8 @@ struct transaction_s
* How many handles used this transaction? [t_handle_lock]
*/
int t_handle_count;
+#define NR_HANDLES 400
+ handle_t * handles[NR_HANDLES];

/*
* Protects the callback list


2007-11-27 22:39:18

by Andrew Morton

[permalink] [raw]
Subject: Re: /proc dcache deadlock in do_exit

On Tue, 27 Nov 2007 14:20:22 +0100
Andrea Arcangeli <[email protected]> wrote:

> Hi,
>
> this patch fixes a sles9 system hang in start_this_handle from a
> customer with some heavy workload where all tasks are waiting on
> kjournald to commit the transaction, but kjournald waits on t_updates
> to go down to zero (it never does). This was reported as a lowmem
> shortage deadlock but when checking the debug data I noticed the VM
> wasn't under pressure at all (well it was really under vm pressure,
> because lots of tasks hanged in the VM prune_dcache methods trying to
> flush dirty inodes, but no task was hanging in GFP_NOFS mode, the
> holder of the journal handle should have if this was a vm issue in the
> first place). No task was apparently holding the leftover handle in
> the committing transaction, so I deduced t_updates was stuck to 1
> because a journal_stop was never run by some path (this turned out to
> be correct). With a debug patch adding proper reverse links and stack
> trace logging in ext3 deployed in production, I found journal_stop is
> never run because mark_inode_dirty_sync is called inside release_task
> called by do_exit. (that was quite fun because I would have never
> thought about this subtleness, I thought a regular path in ext3 had a
> bug and it forgot to call journal_stop)
>
> do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
> come back to run journal_stop)

I don't see why the schedule() will not return? Because the task has
PF_EXITING set? Doesn't TASK_DEAD do that?

> The reason is that shrink_dcache_parent is racy by design (feature not
> a bug) and it can do blocking I/O in some case, but the point is that
> calling shrink_dcache_parent at the last stage of do_exit isn't safe
> for self-reaping tasks.
>
> I guess the memory pressure of the unbalanced highmem system allowed
> to trigger this more easily.
>
> Now mainline doesn't have this line in iput (like sles9 has):
>
> if (inode->i_state & I_DIRTY_DELAYED)
> mark_inode_dirty_sync(inode);
>
> so it will probably not crash with ext3, but for example ext2
> implements an I/O-blocking ext2_put_inode that will lead to similar
> screwups with ext2_free_blocks never coming back and it's definitely
> wrong to call blocking-IO paths inside do_exit. So this should fix a
> subtle bug in mainline too (not verified in practice though). The
> equivalent fix for ext3 is also not verified yet to fix the problem in
> sles9 but I don't have doubt it will (it usually takes days to crash,
> so it'll take weeks to be sure).
>
> An alternate fix would be to offload that work to a kernel thread, but
> I don't think a reschedule for this is worth it, the vm should be able
> to collect those entries for the synchronous release_task.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2265,7 +2265,8 @@ static void proc_flush_task_mnt(struct v
> name.len = snprintf(buf, sizeof(buf), "%d", pid);
> dentry = d_hash_and_lookup(mnt->mnt_root, &name);
> if (dentry) {
> - shrink_dcache_parent(dentry);
> + if (!(current->flags & PF_EXITING))
> + shrink_dcache_parent(dentry);
> d_drop(dentry);
> dput(dentry);
> }

What are the implications of not running shrink_dcache_parent() on the exit
path sometimes? We'll leave procfs stuff behind? Will they be reaped by
memory pressure later on?


2007-11-28 01:21:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: /proc dcache deadlock in do_exit

On Tue, Nov 27, 2007 at 02:38:52PM -0800, Andrew Morton wrote:
> I don't see why the schedule() will not return? Because the task has
> PF_EXITING set? Doesn't TASK_DEAD do that?

Ouch, I assumed you couldn't sleep safely anymore in release_task
given it's the function that will free the task structure itself and
there was no preempt related action anywhere close to it!
delayed_put_task_struct can be called if a quiescent point is reached
and any scheduling would exactly allow it to run (it requires quite a
bit of a race, with local irq triggering a reschedule and the timer
irq invoking the tasklet to run to free the task struct before do_exit
finishes and all other cpus in quiescent state too).

So a corollary question is how can it be safe to call
preempt_disable() after call_rcu(delayed_put_task_struct)?

Back in sles9 preempt_disable was implemented as
_raw_write_unlock(&tasklist_lock) and it happened _before_
release_task, and scheduling there wouldn't return because PF_DEAD was
already set. If mainline can come back, it will crash for a different
reason because the task struct is long gone by the time
release_task+schedule() runs. Either ways, still a kernel crashing bug
there is. Or is there some magic that prevents call_rcu + schedule to
invoke the rcu callback?

So you may need to apply this one too (this one is needed to fix the
second bug, my previous patch is needed after applying this one):

Signed-off-by: Andrea Arcangeli <[email protected]>

diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -841,6 +841,13 @@ static void exit_notify(struct task_stru

write_unlock_irq(&tasklist_lock);

+ /*
+ * Task struct can go away at the first schedule if this was a
+ * self reaping task. Scheduling is forbidden until we set
+ * the state to TASK_DEAD.
+ */
+ preempt_disable();
+
/* If the process is dead, release it - nobody will wait for it */
if (state == EXIT_DEAD)
release_task(tsk);
@@ -1042,7 +1049,6 @@ fastcall NORET_TYPE void do_exit(long co
if (tsk->splice_pipe)
__free_pipe_info(tsk->splice_pipe);

- preempt_disable();
/* causes final put_task_struct in finish_task_switch(). */
tsk->state = TASK_DEAD;


> What are the implications of not running shrink_dcache_parent() on
> the exit path sometimes? We'll leave procfs stuff behind? Will
> they be reaped by memory pressure later on?

Yes.

2007-11-28 01:44:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: /proc dcache deadlock in do_exit

Andrew Morton <[email protected]> writes:

> On Tue, 27 Nov 2007 14:20:22 +0100
> Andrea Arcangeli <[email protected]> wrote:
>
>> Hi,
>>
>> this patch fixes a sles9 system hang in start_this_handle from a
>> customer with some heavy workload where all tasks are waiting on
>> kjournald to commit the transaction, but kjournald waits on t_updates
>> to go down to zero (it never does). This was reported as a lowmem
>> shortage deadlock but when checking the debug data I noticed the VM
>> wasn't under pressure at all (well it was really under vm pressure,
>> because lots of tasks hanged in the VM prune_dcache methods trying to
>> flush dirty inodes, but no task was hanging in GFP_NOFS mode, the
>> holder of the journal handle should have if this was a vm issue in the
>> first place). No task was apparently holding the leftover handle in
>> the committing transaction, so I deduced t_updates was stuck to 1
>> because a journal_stop was never run by some path (this turned out to
>> be correct). With a debug patch adding proper reverse links and stack
>> trace logging in ext3 deployed in production, I found journal_stop is
>> never run because mark_inode_dirty_sync is called inside release_task
>> called by do_exit. (that was quite fun because I would have never
>> thought about this subtleness, I thought a regular path in ext3 had a
>> bug and it forgot to call journal_stop)
>>
>> do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
>> come back to run journal_stop)
>
> I don't see why the schedule() will not return? Because the task has
> PF_EXITING set? Doesn't TASK_DEAD do that?

Yes, why do we not come back from schedule?

If we are not allowed to schedule after setting PF_EXITING before
we set TASK_DEAD that entire code path sounds brittle and
error prone.


> What are the implications of not running shrink_dcache_parent() on the exit
> path sometimes? We'll leave procfs stuff behind? Will they be reaped by
> memory pressure later on?

It should. I think the reaping is just an optimization. Because we
know we will never need those dentries again, and we can pin them by
open directories or opening files. What I don't know off the top of
my head is if there is a d_drop equivalent going on that might be a
problem if we don't address it.

Eric

2007-11-28 01:45:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: /proc dcache deadlock in do_exit

On Wed, Nov 28, 2007 at 02:21:29AM +0100, Andrea Arcangeli wrote:
> On Tue, Nov 27, 2007 at 02:38:52PM -0800, Andrew Morton wrote:
> > I don't see why the schedule() will not return? Because the task has
> > PF_EXITING set? Doesn't TASK_DEAD do that?
>
> Ouch, I assumed you couldn't sleep safely anymore in release_task
> given it's the function that will free the task structure itself and
> there was no preempt related action anywhere close to it!
> delayed_put_task_struct can be called if a quiescent point is reached
> and any scheduling would exactly allow it to run (it requires quite a
> bit of a race, with local irq triggering a reschedule and the timer
> irq invoking the tasklet to run to free the task struct before do_exit
> finishes and all other cpus in quiescent state too).
>
> So a corollary question is how can it be safe to call
> preempt_disable() after call_rcu(delayed_put_task_struct)?
>
> Back in sles9 preempt_disable was implemented as
> _raw_write_unlock(&tasklist_lock) and it happened _before_
> release_task, and scheduling there wouldn't return because PF_DEAD was
> already set. If mainline can come back, it will crash for a different
> reason because the task struct is long gone by the time
> release_task+schedule() runs. Either ways, still a kernel crashing bug
> there is. Or is there some magic that prevents call_rcu + schedule to
> invoke the rcu callback?
>
> So you may need to apply this one too (this one is needed to fix the
> second bug, my previous patch is needed after applying this one):

thinking what happened once already, I think this would be more
debuggable but maybe not... dunno.

Signed-off-by: Andrea Arcangeli <[email protected]>

diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -841,6 +841,14 @@ static void exit_notify(struct task_stru

write_unlock_irq(&tasklist_lock);

+ /*
+ * Task struct can go away at the first schedule if this was a
+ * self reaping task after calling release_task. Scheduling is
+ * forbidden until do_exit finishes.
+ */
+ preempt_disable();
+ tsk->state = TASK_DEAD;
+
/* If the process is dead, release it - nobody will wait for it */
if (state == EXIT_DEAD)
release_task(tsk);
@@ -1042,10 +1050,7 @@ fastcall NORET_TYPE void do_exit(long co
if (tsk->splice_pipe)
__free_pipe_info(tsk->splice_pipe);

- preempt_disable();
/* causes final put_task_struct in finish_task_switch(). */
- tsk->state = TASK_DEAD;
-
schedule();
BUG();
/* Avoid "noreturn function does return". */

2007-11-28 01:55:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: /proc dcache deadlock in do_exit

Andrea Arcangeli <[email protected]> writes:

> On Tue, Nov 27, 2007 at 02:38:52PM -0800, Andrew Morton wrote:
>> I don't see why the schedule() will not return? Because the task has
>> PF_EXITING set? Doesn't TASK_DEAD do that?
>
> Ouch, I assumed you couldn't sleep safely anymore in release_task
> given it's the function that will free the task structure itself and
> there was no preempt related action anywhere close to it!
> delayed_put_task_struct can be called if a quiescent point is reached
> and any scheduling would exactly allow it to run (it requires quite a
> bit of a race, with local irq triggering a reschedule and the timer
> irq invoking the tasklet to run to free the task struct before do_exit
> finishes and all other cpus in quiescent state too).
>
> So a corollary question is how can it be safe to call
> preempt_disable() after call_rcu(delayed_put_task_struct)?


> Back in sles9 preempt_disable was implemented as
> _raw_write_unlock(&tasklist_lock) and it happened _before_
> release_task, and scheduling there wouldn't return because PF_DEAD was
> already set. If mainline can come back, it will crash for a different
> reason because the task struct is long gone by the time
> release_task+schedule() runs. Either ways, still a kernel crashing bug
> there is. Or is there some magic that prevents call_rcu + schedule to
> invoke the rcu callback?

I don't quite see where it comes from but there is another reference
on the task struct held by the scheduler That we don't drop until
finish_task_switch with exit_state == TASK_DEAD.

So since we have one additional reference the task_struct won't
get freed.

We don't set TASK_DEAD until much later.

> So you may need to apply this one too (this one is needed to fix the
> second bug, my previous patch is needed after applying this one):

No. We should be fine.

In fact it looks like this is just a sles9 issue and mainline is
fine, as we can safely schedule until just before the end of do_exit.

Eric

2007-11-28 02:08:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: /proc dcache deadlock in do_exit

Andrea Arcangeli <[email protected]> writes:

>>
>> So you may need to apply this one too (this one is needed to fix the
>> second bug, my previous patch is needed after applying this one):
>
> thinking what happened once already, I think this would be more
> debuggable but maybe not... dunno.

Please god no. The inability to schedule is a bug that apparently
was fixed after sles9 forked off.

In mainline it is safe to sleep until the very end of do_exit and
your suggest patch breaks that.

None of your patches are necessary for mainline.



I believe the commit below is the one that really fixed this issue,
although it may have been something earlier. But it may have been
earlier, and there have been some significant clean ups since then.

> commit 00732b345b940de766fefaf1297ce26a67bdcea9
> Author: mingo <mingo>
> Date: Tue Oct 19 06:12:06 2004 +0000
>
> [PATCH] fix & clean up zombie/dead task handling & preemption
>
> This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we
> had. Right now, the moment procfs does a down() that sleeps in
> proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be
> back to TASK_RUNNING to and we trigger this assert:
>
> schedule();
> BUG();
> /* Avoid "noreturn function does return". */
> for (;;) ;
>
> I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state
> field, to allow the detaching of exit-signal/parent/wait-handling from
> descheduling a dead task. Dead-task freeing is done via PF_DEAD.
>
> Tested the patch on x86 SMP and UP, but all architectures should work
> fine.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> BKrev: 4174b036IkpU2-XeLFkjqH681u4Pyg
>

Eric

2007-11-28 02:11:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: /proc dcache deadlock in do_exit


Nacked-by: "Eric W. Biederman" <[email protected]>

Andrew Morton <[email protected]> writes:

> I don't see why the schedule() will not return? Because the task has
> PF_EXITING set? Doesn't TASK_DEAD do that?

This appears to be a work around for an old bug only present in sles9.

It looks like it has been safe to schedule in release task for years
in mainline, and that is simply much more robust.

Eric

2007-11-28 16:31:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: /proc dcache deadlock in do_exit

On 11/27, Eric W. Biederman wrote:
>
> Andrew Morton <[email protected]> writes:
>
> > On Tue, 27 Nov 2007 14:20:22 +0100
> > Andrea Arcangeli <[email protected]> wrote:
> >
> >> do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
> >> come back to run journal_stop)
> >
> > I don't see why the schedule() will not return? Because the task has
> > PF_EXITING set? Doesn't TASK_DEAD do that?
>
> Yes, why do we not come back from schedule?
>
> If we are not allowed to schedule after setting PF_EXITING before
> we set TASK_DEAD that entire code path sounds brittle and
> error prone.

Yes, it is fine to schedule after release_task(). As Eric pointed out, we
don't race with call_rcu(delayed_put_task_struct), scheduler has another
reference

dup_task_struct:
/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);



However, with CONFIG_HOTPLUG_CPU we do have the problem here, but this is
off-topic. Preemption is fine, but deactivate_task() is not. We can't migrate
the deactivated released task from the dead CPU.

migrate_live_tasks() can't find the task after __unhash_process()

migrate_dead_tasks() doesn't see it after deactivate_task().

And afaics try_to_wake_up() doesn't necessary change task_cpu() if it is
offline.

No? But again, this is offtopic even if I am right.

Oleg.