2005-04-18 19:38:44

by linas

[permalink] [raw]
Subject: PATCH [PPC64]: dead processes never reaped



Hi,

The patch below appears to fix a problem where a number of dead processes
linger on the system. On a highly loaded system, dozens of processes
were found stuck in do_exit(), calling thier very last schedule(), and
then being lost forever.

Processes that are PF_DEAD are cleaned up *after* the context switch,
in a routine called finish_task_switch(task_t *prev). The "prev" gets
the value returned by _switch() in entry.S, but this value comes from

__switch_to (struct task_struct *prev,
struct task_struct *new)
{
old_thread = &current->thread; ///XXX shouldn't this be prev, not current?
last = _switch(old_thread, new_thread);
return last;
}

The way I see it, "prev" and "current" are almost always going to be
pointing at the same thing; however, if a "need resched" happens,
or there's a pre-emept or some-such, then prev and current won't be
the same; in which case, finish_task_switch() will end up cleaning
up the old current, instead of prev. This will result in dead processes
hanging around, which will never be scheduled again, and will never
get a chance to have put_task_struct() called on them.

This patch fixes this.

Signed-off-by: Linas Vepstas <[email protected]>


--- arch/ppc64/kernel/process.c.orig 2005-04-18 14:26:42.000000000 -0500
+++ arch/ppc64/kernel/process.c 2005-04-18 14:27:54.000000000 -0500
@@ -204,7 +204,7 @@ struct task_struct *__switch_to(struct t
flush_tlb_pending();

new_thread = &new->thread;
- old_thread = &current->thread;
+ old_thread = &prev->thread;

local_irq_save(flags);
last = _switch(old_thread, new_thread);


2005-04-19 01:03:55

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PATCH [PPC64]: dead processes never reaped

On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote:
>
> Hi,
>
> The patch below appears to fix a problem where a number of dead processes
> linger on the system. On a highly loaded system, dozens of processes
> were found stuck in do_exit(), calling thier very last schedule(), and
> then being lost forever.
>
> Processes that are PF_DEAD are cleaned up *after* the context switch,
> in a routine called finish_task_switch(task_t *prev). The "prev" gets
> the value returned by _switch() in entry.S, but this value comes from
>
> __switch_to (struct task_struct *prev,
> struct task_struct *new)
> {
> old_thread = &current->thread; ///XXX shouldn't this be prev, not current?
> last = _switch(old_thread, new_thread);
> return last;
> }
>
> The way I see it, "prev" and "current" are almost always going to be
> pointing at the same thing; however, if a "need resched" happens,
> or there's a pre-emept or some-such, then prev and current won't be
> the same; in which case, finish_task_switch() will end up cleaning
> up the old current, instead of prev. This will result in dead processes
> hanging around, which will never be scheduled again, and will never
> get a chance to have put_task_struct() called on them.

I wonder why we bother doing all that at all... we could just return
"prev" from __switch_to() no ? Like x86 does...

Ben.


2005-04-19 01:09:16

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PATCH [PPC64]: dead processes never reaped

On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote:
>
> Hi,
>
> The patch below appears to fix a problem where a number of dead processes
> linger on the system. On a highly loaded system, dozens of processes
> were found stuck in do_exit(), calling thier very last schedule(), and
> then being lost forever.
>
> Processes that are PF_DEAD are cleaned up *after* the context switch,
> in a routine called finish_task_switch(task_t *prev). The "prev" gets
> the value returned by _switch() in entry.S, but this value comes from
>
> __switch_to (struct task_struct *prev,
> struct task_struct *new)
> {
> old_thread = &current->thread; ///XXX shouldn't this be prev, not current?
> last = _switch(old_thread, new_thread);
> return last;
> }
>
> The way I see it, "prev" and "current" are almost always going to be
> pointing at the same thing; however, if a "need resched" happens,
> or there's a pre-emept or some-such, then prev and current won't be
> the same; in which case, finish_task_switch() will end up cleaning
> up the old current, instead of prev. This will result in dead processes
> hanging around, which will never be scheduled again, and will never
> get a chance to have put_task_struct() called on them.

Ok, thinking moer about this ... that will need maybe some help from
Ingo so I fully understand where schedule's are allowed ... We are
basically in the middle of the scheduler here, so I wonder how much of
the scheduler itself can be preempted or so ...

Basically, under which circumstances can prev and current be different ?

Ben.


2005-04-19 01:25:34

by Nick Piggin

[permalink] [raw]
Subject: Re: PATCH [PPC64]: dead processes never reaped

On Tue, 2005-04-19 at 11:07 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote:
> >
> > Hi,
> >
> > The patch below appears to fix a problem where a number of dead processes
> > linger on the system. On a highly loaded system, dozens of processes
> > were found stuck in do_exit(), calling thier very last schedule(), and
> > then being lost forever.
> >
> > Processes that are PF_DEAD are cleaned up *after* the context switch,
> > in a routine called finish_task_switch(task_t *prev). The "prev" gets
> > the value returned by _switch() in entry.S, but this value comes from
> >
> > __switch_to (struct task_struct *prev,
> > struct task_struct *new)
> > {
> > old_thread = &current->thread; ///XXX shouldn't this be prev, not current?
> > last = _switch(old_thread, new_thread);
> > return last;
> > }
> >
> > The way I see it, "prev" and "current" are almost always going to be
> > pointing at the same thing; however, if a "need resched" happens,
> > or there's a pre-emept or some-such, then prev and current won't be
> > the same; in which case, finish_task_switch() will end up cleaning
> > up the old current, instead of prev. This will result in dead processes
> > hanging around, which will never be scheduled again, and will never
> > get a chance to have put_task_struct() called on them.
>
> Ok, thinking moer about this ... that will need maybe some help from
> Ingo so I fully understand where schedule's are allowed ... We are
> basically in the middle of the scheduler here, so I wonder how much of
> the scheduler itself can be preempted or so ...
>

Not much. schedule() has a small preempt window at the beginning
and end of the function.

The context switch is of course run with preempt disabled. Ie.
your switch_to should never get preempted.

> Basically, under which circumstances can prev and current be different ?
>

Depends on your context switch, really. prev == current before you
switch, and when you switch to 'new' it is different. However, I think
the 'new' current has *its* old prev on the stack (which == new
current). You just have to preserve the old 'prev' somehow (ie. the
thread you switched away from).

--
SUSE Labs, Novell Inc.


2005-04-19 17:22:38

by linas

[permalink] [raw]
Subject: Re: PATCH [PPC64]: dead processes never reaped

On Tue, Apr 19, 2005 at 11:07:01AM +1000, Benjamin Herrenschmidt was heard to remark:
> On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote:
> >
> > Hi,
> >
> > The patch below appears to fix a problem where a number of dead processes
> > linger on the system. On a highly loaded system, dozens of processes
> > were found stuck in do_exit(), calling thier very last schedule(), and
> > then being lost forever.
> >
> > Processes that are PF_DEAD are cleaned up *after* the context switch,
> > in a routine called finish_task_switch(task_t *prev). The "prev" gets
> > the value returned by _switch() in entry.S, but this value comes from
> >
> > __switch_to (struct task_struct *prev,
> > struct task_struct *new)
> > {
> > old_thread = &current->thread; ///XXX shouldn't this be prev, not current?
> > last = _switch(old_thread, new_thread);
> > return last;
> > }
> >
> > The way I see it, "prev" and "current" are almost always going to be
> > pointing at the same thing; however, if a "need resched" happens,
> > or there's a pre-emept or some-such, then prev and current won't be
> > the same; in which case, finish_task_switch() will end up cleaning
> > up the old current, instead of prev. This will result in dead processes
> > hanging around, which will never be scheduled again, and will never
> > get a chance to have put_task_struct() called on them.
>
> Ok, thinking moer about this ... that will need maybe some help from
> Ingo so I fully understand where schedule's are allowed ... We are
> basically in the middle of the scheduler here, so I wonder how much of
> the scheduler itself can be preempted or so ...
>
> Basically, under which circumstances can prev and current be different ?

I remember finding a path through void __sched schedule(void) that
took a branch through the goto need_resched; that would result in this.
I takes a bit of mental gymnastics to see how this might happen.

FWIW, I can send you a debug session showing all cpu's idle and 44 dead
processes sitting in do_exit(). All but two of these were Java threads,
so this seems to be some sort of thread-scheduling subtlty. (the two
that weren't java threads were a find|grep pair that must have gotten
tangled in.)

Given that the patch seems to fix the problem, I didn't dig much deeper.

--linas

2005-04-19 17:39:43

by linas

[permalink] [raw]
Subject: Re: PATCH [PPC64]: dead processes never reaped

On Tue, Apr 19, 2005 at 11:01:25AM +1000, Benjamin Herrenschmidt was heard to remark:
> On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote:
> >
> > Hi,
> >
> > The patch below appears to fix a problem where a number of dead processes
> > linger on the system. On a highly loaded system, dozens of processes
> > were found stuck in do_exit(), calling thier very last schedule(), and
> > then being lost forever.
> >
> > Processes that are PF_DEAD are cleaned up *after* the context switch,
> > in a routine called finish_task_switch(task_t *prev). The "prev" gets
> > the value returned by _switch() in entry.S, but this value comes from
> >
> > __switch_to (struct task_struct *prev,
> > struct task_struct *new)
> > {
> > old_thread = &current->thread; ///XXX shouldn't this be prev, not current?
> > last = _switch(old_thread, new_thread);
> > return last;
> > }
> >
> > The way I see it, "prev" and "current" are almost always going to be
> > pointing at the same thing; however, if a "need resched" happens,
> > or there's a pre-emept or some-such, then prev and current won't be
> > the same; in which case, finish_task_switch() will end up cleaning
> > up the old current, instead of prev. This will result in dead processes
> > hanging around, which will never be scheduled again, and will never
> > get a chance to have put_task_struct() called on them.
>
> I wonder why we bother doing all that at all... we could just return
> "prev" from __switch_to() no ? Like x86 does...

Probably. I assume this funny two-step is left-over from a 2.4 kernel
design point. Naively, we could rturn "prev", this would save a few
cycles. Cut the "addi r3,r3,-THREAD" from entry.S as well. I was being
conservative with the patch, making the smallest change possible.
Do you want this larger patch?


--linas

2005-04-20 05:46:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PATCH [PPC64]: dead processes never reaped

On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote:
>
> Hi,
>
> The patch below appears to fix a problem where a number of dead processes
> linger on the system. On a highly loaded system, dozens of processes
> were found stuck in do_exit(), calling thier very last schedule(), and
> then being lost forever.

Ok, we spent some time with Paul decrypting what _switch_to is supposed
to do. Our understanding at this point is that the current code is
correct on both ppc32 and ppc64, that is:

The "prev" passed in is always "current" and we don't see how it can be
anything else. We use a local variable instead of current in the common
code because accessing current can be slow on some architectures. I
don't see any codepath where prev != current before switch_to.

If we didn't do some black magic that I explain below, _switch_to would
switch the entire context, including stack, and thus including the value
of "prev". Which means that we would always come back with prev beeing
current, which is useless for reaping the old task. What we want is that
this "prev" that was passed to _switch_to() is returned so that we can
rip that previous task despite the change of context, that is basically
prev has to be an invariant vs. the change of context in switch_to.

On ppc & ppc64, we implement that by passing that prev (or it's thread
counterpart) to the assembly context switch code in r3. This code will
preserve it and return it as-is (or re-transformed from thread to task).

So your problem must be somewhere else. I've looked at the need_resched
code path and we always reload prev = current from a non-preemptible
region, so it can't be wrong.

This was verified on 2.6.12-rc2, there might be something else wrong in
an older kernel.

Ben.


2005-05-05 21:57:43

by linas

[permalink] [raw]
Subject: Re: PATCH [PPC64]: dead processes never reaped

On Wed, Apr 20, 2005 at 03:44:10PM +1000, Benjamin Herrenschmidt was heard to remark:
> On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote:
> >
> > The patch below appears to fix a problem where a number of dead processes
> > linger on the system. On a highly loaded system, dozens of processes
> > were found stuck in do_exit(), calling thier very last schedule(), and
> > then being lost forever.

And this problem seems to be unreproducible. Dang, it was one of the
more interesting ones I've seen.

--linas