schedule() checks PF_DEAD on every context switch, and sets ->state = EXIT_DEAD
to ensure that exited task will be deactivated.
I think it is better to set EXIT_DEAD in do_exit(), along with PF_DEAD flag.
It is safe to do without task_rq() locking, because concurrent try_to_wake_up()
can't change task's ->state: the 'state' argument of try_to_wake_up() can't have
EXIT_DEAD bit. And in case when try_to_wake_up() sees stale value of ->state ==
TASK_RUNNING it will do nothing.
Signed-off-by: Oleg Nesterov <[email protected]>
--- 2.6.15-rc2/kernel/exit.c~4_DEAD 2005-11-23 19:33:27.000000000 +0300
+++ 2.6.15-rc2/kernel/exit.c 2005-11-24 19:14:29.000000000 +0300
@@ -875,6 +875,7 @@ fastcall NORET_TYPE void do_exit(long co
preempt_disable();
BUG_ON(tsk->flags & PF_DEAD);
tsk->flags |= PF_DEAD;
+ tsk->state = EXIT_DEAD;
schedule();
BUG();
--- 2.6.15-rc2/kernel/sched.c~4_DEAD 2005-11-22 19:35:52.000000000 +0300
+++ 2.6.15-rc2/kernel/sched.c 2005-11-24 19:13:34.000000000 +0300
@@ -3000,9 +3000,6 @@ need_resched_nonpreemptible:
spin_lock_irq(&rq->lock);
- if (unlikely(prev->flags & PF_DEAD))
- prev->state = EXIT_DEAD;
-
switch_count = &prev->nivcsw;
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
switch_count = &prev->nvcsw;
* Oleg Nesterov <[email protected]> wrote:
> schedule() checks PF_DEAD on every context switch, and sets ->state =
> EXIT_DEAD to ensure that exited task will be deactivated.
>
> I think it is better to set EXIT_DEAD in do_exit(), along with PF_DEAD
> flag.
nice idea - your patch looks good to me.
> It is safe to do without task_rq() locking, because concurrent
> try_to_wake_up() can't change task's ->state: the 'state' argument of
> try_to_wake_up() can't have EXIT_DEAD bit. And in case when
> try_to_wake_up() sees stale value of ->state == TASK_RUNNING it will
> do nothing.
we should really not be getting concurrent wakeups in this situation
anyway - and you are right that even if we got, it should have no effect
neither in the EXIT_DEAD nor in the TASK_RUNNING case.
Acked-by: Ingo Molnar <[email protected]>
Ingo
On Fri, 25 Nov 2005, Ingo Molnar wrote:
> >
> > I think it is better to set EXIT_DEAD in do_exit(), along with PF_DEAD
> > flag.
>
> nice idea - your patch looks good to me.
I'm not entirely convinced.
The thing is, We used to have DEAD in the task state flags, ie TASK_ZOMBIE
was it.
We started using PF_DEAD in 2003 with this commit message:
Author: Linus Torvalds <[email protected]> 2003-10-26 03:16:23
Add a sticky "PF_DEAD" task flag to keep track of dead processes.
Use this to simplify 'finish_task_switch', but perhaps more
importantly we can use this to track down why some processes
seem to sometimes not die properly even after having been
marked as ZOMBIE. The "task->state" flags are too fluid to
allow that well.
ie the PF_DEAD flag was never really about is _needing_ it: it was all
about being able to safely check it _without_ having to rely on
task->state.
So putting it back into task->state is not wrong per se, but it kind of
misses the point of why it was somewhere else in the first place (or
rather, why it was there in the _second_ place, since it was in
task->state in the first place and got moved out of there).
Linus
Linus Torvalds wrote:
>
> I'm not entirely convinced.
>
> The thing is, We used to have DEAD in the task state flags, ie TASK_ZOMBIE
> was it.
>
> We started using PF_DEAD in 2003 with this commit message:
>
> Author: Linus Torvalds <[email protected]> 2003-10-26 03:16:23
>
> Add a sticky "PF_DEAD" task flag to keep track of dead processes.
>
> Use this to simplify 'finish_task_switch', but perhaps more
> importantly we can use this to track down why some processes
> seem to sometimes not die properly even after having been
> marked as ZOMBIE. The "task->state" flags are too fluid to
> allow that well.
>
> ie the PF_DEAD flag was never really about is _needing_ it: it was all
> about being able to safely check it _without_ having to rely on
> task->state.
>
> So putting it back into task->state is not wrong per se, but it kind of
> misses the point of why it was somewhere else in the first place (or
> rather, why it was there in the _second_ place, since it was in
> task->state in the first place and got moved out of there).
schedule:
if (unlikely(prev->flags & PF_DEAD))
prev->state = EXIT_DEAD;
Which means: "If PF_DEAD is set, ignore ->state value. It should be TASK_RUNNING,
but we have to change it, otherwise the task won't be deactivated. We are using
EXIT_DEAD (which should live only in ->exit_state) because other TASK_XXX values
won't work".
So in my opinion PF_DEAD has already slipped into the ->state partly. And I still
think that at least the first patch is ok, it does not change the things, but may
be considered as microoptimization.
On the other hand this microoptimization is negligible, all cleanups is a matter
of taste always, so let's forget about it if you don't change you mind.
Oleg.
On Sat, 26 Nov 2005, Oleg Nesterov wrote:
> >
> > So putting it back into task->state is not wrong per se, but it kind of
> > misses the point of why it was somewhere else in the first place (or
> > rather, why it was there in the _second_ place, since it was in
> > task->state in the first place and got moved out of there).
>
> schedule:
>
> if (unlikely(prev->flags & PF_DEAD))
> prev->state = EXIT_DEAD;
>
> Which means: "If PF_DEAD is set, ignore ->state value. It should be TASK_RUNNING,
> but we have to change it, otherwise the task won't be deactivated. We are using
> EXIT_DEAD (which should live only in ->exit_state) because other TASK_XXX values
> won't work".
>
> So in my opinion PF_DEAD has already slipped into the ->state partly.
You mis-understand.
PF_DEAD has _always_ been about the task state, in a very serious way. It
didn't "slip into" it. It always was very much about it.
The problem is that we touch "task->state" in a _lot_ of places: for
example, when we take a page fault, we have to clear it, because we can't
just run with some random task state (see top of __handle_mm_fault).
PF_DEAD was a "safe haven". It's somewhere that we _don't_ modify the word
in many places, so it doesn't get lost, and we can do sanity checking (ie
we can have things like "BUG_ON(tsk->flags & PF_DEAD)" to make sure that
the task really is valid in a few places.
Now, arguably the task struct handling is solid enough that maybe we don't
need this any more. But this is what it was all about: it was hidden away
in a non-obvious place exactly _because_ we wanted it hidden away
somewhere where the normal ops wouldn't ever touch it.
Linus
Linus Torvalds wrote:
>
> > So in my opinion PF_DEAD has already slipped into the ->state partly.
>
> You mis-understand.
Yes.
Ok, I see you point now, thanks.
Oleg.
> PF_DEAD has _always_ been about the task state, in a very serious way. It
> didn't "slip into" it. It always was very much about it.
>
> The problem is that we touch "task->state" in a _lot_ of places: for
> example, when we take a page fault, we have to clear it, because we can't
> just run with some random task state (see top of __handle_mm_fault).
>
> PF_DEAD was a "safe haven". It's somewhere that we _don't_ modify the word
> in many places, so it doesn't get lost, and we can do sanity checking (ie
> we can have things like "BUG_ON(tsk->flags & PF_DEAD)" to make sure that
> the task really is valid in a few places.
>
> Now, arguably the task struct handling is solid enough that maybe we don't
> need this any more. But this is what it was all about: it was hidden away
> in a non-obvious place exactly _because_ we wanted it hidden away
> somewhere where the normal ops wouldn't ever touch it.
>
> Linus
On Sat, 26 Nov 2005, Oleg Nesterov wrote:
>
> Ok, I see you point now, thanks.
Well, in all fairness, it's entirely possible (nay, likely) that PF_DEAD
just isn't relevant the way it used to be.
Looking at the history, the immediate reason for PF_DEAD was (I think)
that we had a race where we would first mark the task as a ZOMBIE, and
then if it was self-reaping, we'd mark it dead in release_task(). But we
dropped the tasklist_lock in between, which meant that the real parent
could come in and reap it just before it reaped itself, and all hell would
break lose.
That bug was fixed sixteen hours after PF_DEAD was introduced (thanks to a
bug-on on PF_DEAD that never even made the kernel history, I suspect), and
I doubt PF_DEAD has done anything for us since (and this was two years
ago).
What I'm trying to say is that it's entirely possible that your cleanup is
just way overdue, and we shouldn't have that separate PF_DEAD thing any
more. My arguments against your patch is mainly me just being nervous, and
I'm not 100% convinced they are valid and good arguments.
HOWEVER. I just noticed something strange. EXIT_DEAD should be in
"task->exit_state", not in "task->state". So there's something strange
going on in that neck of the woods _anyway_. That whole
...
if (unlikely(prev->flags & PF_DEAD))
prev->state = EXIT_DEAD;
...
in kernel/sched.c seems totally bogus.
I think we should remove that thing entirely, since it's actually a total
no-op, apart from doing something that is actively wrong. The "exit_state"
flag should already _be_ EXIT_DEAD, no? And the "task->state" flag should
never be EXIT_DEAD in the first place.
And now that "exit_state" is already separate from "state", the reason for
PF_DEAD really is gone, and it could be replaced with a
tsk->exit_state & EXIT_DEAD
test instead.
Roland added to Cc:, because the exit_state splitup was his, I think.
Oleg? Roland?
Linus
Linus Torvalds wrote:
>
> HOWEVER. I just noticed something strange. EXIT_DEAD should be in
> "task->exit_state", not in "task->state". So there's something strange
> going on in that neck of the woods _anyway_. That whole
>
> ...
> if (unlikely(prev->flags & PF_DEAD))
> prev->state = EXIT_DEAD;
> ...
>
> in kernel/sched.c seems totally bogus.
We can use any random value instead of EXIT_DEAD (except RUNNING,INTERRUPTIBLE,
and UNINTERRUPTIBLE). The only reason for changing the state's value is this
check below:
if (->state && !PREEMPT_ACTIVE) {
...
deactivate_task();
}
This state is invisible to proc/array.c because yes, we already have something
in ->exit_state.
We can add new TASK_DEAD or something for that. Or we can do:
- if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
+ // PF_DEAD means that preemption was disabled
+ if (PF_DEAD || (prev->state && !(preempt_count() & PREEMPT_ACTIVE))) {
but this is so ugly.
> I think we should remove that thing entirely, since it's actually a total
> no-op, apart from doing something that is actively wrong. The "exit_state"
> flag should already _be_ EXIT_DEAD, no?
No, it could be EXIT_ZOMBIE if the task was not auto-reaped. So the task
should be deactivated when ->exit_state != 0. But see below.
> And now that "exit_state" is already separate from "state", the reason for
> PF_DEAD really is gone, and it could be replaced with a
>
> tsk->exit_state & EXIT_DEAD
>
> test instead.
No, I don't think this is right. After exit_notify() sets ->exit_state this
process is still valid from the scheduler POV, it may be preempted, may sleep.
So I beleive we really need to do something special when exiting process does
it's last context switch from do_exit(). If the process enters schedule() in
TASK_RUNNUNG, we should pass the hint to scheduler - PF_DEAD. Or we can change
the ->state before calling schedule().
Oleg.
* Linus Torvalds <[email protected]> wrote:
> > > I think it is better to set EXIT_DEAD in do_exit(), along with PF_DEAD
> > > flag.
> >
> > nice idea - your patch looks good to me.
>
> I'm not entirely convinced.
>
> The thing is, We used to have DEAD in the task state flags, ie
> TASK_ZOMBIE was it.
>
> We started using PF_DEAD in 2003 with this commit message:
>
> Author: Linus Torvalds <[email protected]> 2003-10-26 03:16:23
>
> Add a sticky "PF_DEAD" task flag to keep track of dead processes.
>
> Use this to simplify 'finish_task_switch', but perhaps more
> importantly we can use this to track down why some processes
> seem to sometimes not die properly even after having been
> marked as ZOMBIE. The "task->state" flags are too fluid to
> allow that well.
>
> ie the PF_DEAD flag was never really about is _needing_ it: it was all
> about being able to safely check it _without_ having to rely on
> task->state.
but this was in times when we did alot of nontrivial operations after we
marked a task "dead". Today we do this:
/* PF_DEAD causes final put_task_struct after we schedule. */
preempt_disable();
BUG_ON(tsk->flags & PF_DEAD);
tsk->flags |= PF_DEAD;
schedule();
BUG();
(i introduced the above changes to make more of the exit path
preemptable.)
PF_DEAD has zero relevance by today, and Oleg's patches are perfectly
correct and dont add the kind of risk that they'd have meant in 2003.
> So putting it back into task->state is not wrong per se, but it kind
> of misses the point of why it was somewhere else in the first place
> (or rather, why it was there in the _second_ place, since it was in
> task->state in the first place and got moved out of there).
yeah, PF_DEAD had its purpose back when we first marked a task dead,
then we did the release_task(). We used to have problems with
proc_pid_flush() which could sleep, which would lose the TASK_ZOMBIE or
TASK_DEAD flag and we'd come back from the 'final' schedule().
today that's impossible, due to the code above - we only mark it PF_DEAD
straight before going into the final schedule().
Ingo
* Linus Torvalds <[email protected]> wrote:
> Well, in all fairness, it's entirely possible (nay, likely) that
> PF_DEAD just isn't relevant the way it used to be.
yes, that's the case. I do believe we should go with Oleg's original
patches.
> Looking at the history, the immediate reason for PF_DEAD was (I think)
> that we had a race where we would first mark the task as a ZOMBIE, and
> then if it was self-reaping, we'd mark it dead in release_task(). But
> we dropped the tasklist_lock in between, which meant that the real
> parent could come in and reap it just before it reaped itself, and all
> hell would break lose.
yeah. Another problem was if we accidentally scheduled somewhere after
the task was marked TASK_DEAD. (That used to be caught by another
PF_DEAD check before the final schedule()).
all this stuff got changed as part of the PREEMPT_VOLUNTARY changes, and
now we exit in a much more robust way.
> HOWEVER. I just noticed something strange. EXIT_DEAD should be in
> "task->exit_state", not in "task->state". So there's something strange
> going on in that neck of the woods _anyway_. That whole
>
> ...
> if (unlikely(prev->flags & PF_DEAD))
> prev->state = EXIT_DEAD;
> ...
>
> in kernel/sched.c seems totally bogus.
no, it's not bogus. PF_DEAD is basically just a fancy, persistent flag
for "mark the task state as EXIT_DEAD atomically before the final
schedule". The TASK_DEAD flag is still necessary so that we dont stay on
the runqueue. [we could use TASK_UNINTERRUPTIBLE - it should just not be
TASK_RUNNING or TASK_INTERRUPTIBLE.]
in any case, i agree with Oleg's patches: with the current code it's
unnecessary to signal towards schedule() to turn the PF_DEAD flag into
EXIT_DEAD - we can do it right where we set PF_DEAD: immediately before
calling the final schedule(). PF_DEAD is a relic from the days when the
exit path was structured differently.
Ingo