LinuxLists.cc - Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On Thu, 2002-04-04 at 13:28, Dave Hansen wrote:

> I've replicated the problem too.
> I've diabled preemption in the area where it used to be disabled because
> of the old lock_kernel(). I'm sending this message from a machine with
> that patch applied, so the patch does fix it. As the comment says, this
> is something that the preempt experts need to take a look at.
> Linus, this is a hack, and there is probably still a window where
> preemption can happen. But, it is a band-aid until we find the real
> problem.

Thanks for the CC. I've been looking into this problem. I am not too
sure why we require protection from concurrency via preemption and not
via SMP. In other words, why are we SMP-safe but not preempt-safe here.

I don't really have an answer.

The problem I saw on boot was the BUG being triggered on the last line.
If we are able to preempt here and cause some problem, what proof is
there that we don't have a race wrt SMP?

Robert Love

2002-04-04 19:14:35

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On Thu, 4 Apr 2002, Dave Hansen wrote:
>
> I've diabled preemption in the area where it used to be disabled because
> of the old lock_kernel(). I'm sending this message from a machine with
> that patch applied, so the patch does fix it.

I think the real fix is to make sure that preemption never hits while
current->state == TASK_ZOMBIE.

In other words, I think the _correct_ fix is to just make
preempt_schedule() check for something like

if (current->state != TASK_RUNNING)
return;

and just getting rid of the current "unconditionally set state to
running".

(Side note: if the state isn't running, we're most likely going to
reschedule in a controlled manner soon anyway. Sure, there are some loops
which set state to non-runnable early for race avoidance, but is it a
_problem_?)

Linus

2002-04-04 19:15:28

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On 4 Apr 2002, Robert Love wrote:
>
> Thanks for the CC. I've been looking into this problem. I am not too
> sure why we require protection from concurrency via preemption and not
> via SMP. In other words, why are we SMP-safe but not preempt-safe here.
>
> I don't really have an answer.

The answer is that preempt_schedule() illegally sets

current->state = TASK_RUNNING;

without asking the process whether that's ok. The SMP code never does
anything like that.

Linus

2002-04-04 19:17:35

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On Thu, 2002-04-04 at 14:13, Linus Torvalds wrote:

> I think the real fix is to make sure that preemption never hits while
> current->state == TASK_ZOMBIE.
>
> In other words, I think the _correct_ fix is to just make
> preempt_schedule() check for something like
>
> if (current->state != TASK_RUNNING)
> return;
>
> and just getting rid of the current "unconditionally set state to
> running".
>
> (Side note: if the state isn't running, we're most likely going to
> reschedule in a controlled manner soon anyway. Sure, there are some loops
> which set state to non-runnable early for race avoidance, but is it a
> _problem_?)

Eh, maybe - what about all the code that sets non-running before putting
itself on a wait queue?

Robert Love

2002-04-04 19:28:07

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On Thu, 2002-04-04 at 14:14, Linus Torvalds wrote:

> The answer is that preempt_schedule() illegally sets
>
> current->state = TASK_RUNNING;
>
> without asking the process whether that's ok. The SMP code never does
> anything like that.

Well Ingo added that ;)

We used to just set a flag in the preempt_count that marked the task as
preempted and made sure on its next trip into schedule it ran again.

Do you think it is better to deny preemption if state==TASK_ZOMBIE (note
this requires code in preempt_schedule and the interrupt return path,
since Ingo decoupled the two) or just disable preemption around critical
regions caused by setting state to TASK_ZOMBIE ?

I suspect this is the first occurrence of a problem of this kind ... and
the attached patch handles it.

Robert Love

diff -urN linux-2.5.8-pre1/kernel/exit.c linux/kernel/exit.c
--- linux-2.5.8-pre1/kernel/exit.c Wed Apr 3 20:57:37 2002
+++ linux/kernel/exit.c Thu Apr 4 14:22:30 2002
@@ -499,6 +499,8 @@
acct_process(code);
__exit_mm(tsk);

+ preempt_disable();
+
sem_exit();
__exit_files(tsk);
__exit_fs(tsk);
@@ -515,6 +517,7 @@

tsk->exit_code = code;
exit_notify();
+ preempt_enable_no_resched();
schedule();
BUG();
/*

2002-04-04 19:42:15

by Dave Hansen

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

Robert Love wrote:
> tsk->exit_code = code;
> exit_notify();
> + preempt_enable_no_resched();
* PREEMPT HERE *
> schedule();
> BUG();

Isn't there still a race here? A preempt CAN happen after you reenable,
right? Admittedly, its a small window, but it _is_ a window.

--
Dave Hansen
[email protected]

2002-04-04 19:45:55

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On 4 Apr 2002, Robert Love wrote:
>
> Eh, maybe - what about all the code that sets non-running before putting
> itself on a wait queue?

In most cases that code will call a schedule itself.

Of course, we might make just ZOMBIE a special case, but in general I
think it's simply absolutely wrong for the preemption to change task
internal data structures on its own.

Linus

2002-04-04 19:51:05

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() atboot time

Linus Torvalds wrote:
>
> On Thu, 4 Apr 2002, Dave Hansen wrote:
> >
> > I've diabled preemption in the area where it used to be disabled because
> > of the old lock_kernel(). I'm sending this message from a machine with
> > that patch applied, so the patch does fix it.
>
> I think the real fix is to make sure that preemption never hits while
> current->state == TASK_ZOMBIE.
>
> In other words, I think the _correct_ fix is to just make
> preempt_schedule() check for something like
>
> if (current->state != TASK_RUNNING)
> return;
>
I agree that the set state to running code should go, but there are
places in the kernel where a great deal of time is spent with the state
other than running. AND most of them make sense. Adding a PREEMPTING
super state, as was in the original patch, addressed this quite well
IMHO. This super state was treated by the scheduler as RUNNING
regardless of the actual state. As long as it is not in the actual
"state" member, the condition waited for can still set the true state to
RUNNING and all is well.

What is the objection to this?

> and just getting rid of the current "unconditionally set state to
> running".
>
> (Side note: if the state isn't running, we're most likely going to
> reschedule in a controlled manner soon anyway. Sure, there are some loops
> which set state to non-runnable early for race avoidance, but is it a
> _problem_?)
>
All I can say is that we thought it was when we did the original
preemption work. Some of those loops are long.

I tend to agree with not doing the random "set state to running". It
breeds paranoia, and rightly so.

And yes the ZOMBIE code must not be preempted. This is one of the
reasons the preempt code was designed to allow schedule() to be called
with preemption disabled. I gives a nice clean schedule() call. It
might be a better solution than the interrupt off schedule() calls made
in the sleep calls in sched.c.

> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2002-04-04 20:04:36

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() atboot time

Dave Hansen wrote:
>
> Robert Love wrote:
> > tsk->exit_code = code;
> > exit_notify();
> > + preempt_enable_no_resched();
> * PREEMPT HERE *
> > schedule();
> > BUG();
>
> Isn't there still a race here? A preempt CAN happen after you reenable,
> right? Admittedly, its a small window, but it _is_ a window.

Right, eliminate this line. Since the task is going away it doesn't
matter. Actually it doesn't matter even if schedule() returns (i.e. for
other than ZOMBIE) as long as preempt is enabled after the schedule()
call, somewhere. It is designed to work this way for a reason, this
being one of them.

-g
>
> --
> Dave Hansen
> [email protected]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2002-04-04 20:13:40

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() atboot time

Linus Torvalds wrote:
>
> On 4 Apr 2002, Robert Love wrote:
> >
> > Eh, maybe - what about all the code that sets non-running before putting
> > itself on a wait queue?
>
> In most cases that code will call a schedule itself.
>
> Of course, we might make just ZOMBIE a special case, but in general I
> think it's simply absolutely wrong for the preemption to change task
> internal data structures on its own.
>
Amen!

> Linus
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2002-04-04 20:55:23

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() atboot time

Robert Love wrote:
>
> ...
> Do you think it is better to deny preemption if state==TASK_ZOMBIE (note
> this requires code in preempt_schedule and the interrupt return path,
> since Ingo decoupled the two) or just disable preemption around critical
> regions caused by setting state to TASK_ZOMBIE ?
>
> I suspect this is the first occurrence of a problem of this kind ... and
> the attached patch handles it.
>

No, the problem goes deeper than this.

I have code which does, effectively:

sleeper()
{
spin_lock(&some_lock);
set_current_state(TASK_UNINTERRUPTIBLE);
some_flag = 0;
spin_unlock(&lock);
schedule();
if (some_flag == 0)
i_am_horribly_confused();
}

waker()
{
spin_lock(&some_lock);
some_flag = 1;
wake_up_process(sleeper);
spin_unlock(&some_lock);
}

or something like that. See __pdflush() and
pdflush_operation() in http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.8-pre1/delalloc/dallocbase-60-pdflush.patch

The above code work fine, is nice and I want to keep
it that way. But it fails on preempt.

The spin_unlock() in sleeper() can sometimes set
task->state to TASK_RUNNING(), so my schedule() call
just falls straight through.

Probably nobody has noticed this in other places because
most sleep/wakeup stuff tends to be done inside a loop;
the bogus "wakeup" is ignored.

Although it can be worked around at the call site, I
think this needs fixing. Otherwise we have the rule
"spin_unlock will flip you into TASK_RUNNING 0.0001%
of the time if CONFIG_PREEMPT=y". ug.

I have thought deeply about this, and I then promptly
forgot everything I thought about, but I ended up
concluding that the sanest way of resolving this is
inside __set_current_state(). If the new state is
TASK_RUNNING and the old state is not TASK_RUNNING
then enable preemption, call schedule() if necessary, etc.

It is not acceptable to just say "don't preempt a task
which is not in state TASK_RUNNING", because if an
interrupt happens against a CPU which is running a task
which is in state TASK_INTERRUPTIBLE (say), then that
wakeup won't be serviced until the task exits the kernel.

-

2002-04-04 21:33:43

by Roger Larsson

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On torsdagen den 4 april 2002 21.26, Robert Love wrote:
> On Thu, 2002-04-04 at 14:14, Linus Torvalds wrote:
> > The answer is that preempt_schedule() illegally sets
> >
> > current->state = TASK_RUNNING;
> >
> > without asking the process whether that's OK. The SMP code never does
> > anything like that.
>
> Well Ingo added that ;)
>
> We used to just set a flag in the preempt_count that marked the task as
> preempted and made sure on its next trip into schedule it ran again.
>

How about doing:

asmlinkage void preempt_schedule(void)
{
unsigned long saved_state;

if (unlikely(preempt_get_count()))
return;

preempt_disable(); /* or use an atomic operation */
saved_state = current->state;
current->state = TASK_RUNNING;
preempt_enable_no_resched(); /* we are scheduling anyway... */
schedule();
current->state = saved_state;
}

It is unlikely to get preemption between schedule() and the
setting since schedule it self checks - the window is small.
And when it hits if will correctly restore the correct value.

Note this code does not need to solve the FLAG problem.

current->state |= FLAG
* PREEMPT *
current->state |= FLAG
schedule()
current->state &= ~FLAG
schedule() with flag disabled

/RogerL

--
Roger Larsson
Skellefte?
Sweden

2002-04-04 22:40:44

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

Roger Larsson wrote:
>
> ...
> How about doing:
>
> asmlinkage void preempt_schedule(void)
> {
> unsigned long saved_state;
>
> if (unlikely(preempt_get_count()))
> return;
>
> preempt_disable(); /* or use an atomic operation */
> saved_state = current->state;
> current->state = TASK_RUNNING;
> preempt_enable_no_resched(); /* we are scheduling anyway... */
> schedule();

Interrupt occurs, puts this task in state TASK_RUNNING.

> current->state = saved_state;

whoops. We went back into TASK_UNINTERRUPTIBLE.

> }
>

We could fix this with changes to schedule(), but that's
not nice. Another approach would be:

preempt_schedule()
{
current->state2 = current->state;
current->state = TASK_RUNNING;
schedule();
current->state = current->state2;
}

and wake_up() would do:

tsk->state = TASK_RUNNING;
tsk->state2 = TASK_RUNNING;

With the appropriate locking, memory barriers and other
relevant goo I think this would work...

-

2002-04-04 22:43:54

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On Thu, 4 Apr 2002, Andrew Morton wrote:
> Another approach would be:
>
> preempt_schedule()
> {
> current->state2 = current->state;
> current->state = TASK_RUNNING;
> schedule();
> current->state = current->state2;
> }

Yes, but please no.

My current tree says

asmlinkage void preempt_schedule(void)
{
if (unlikely(preempt_get_count()))
return;
if (current->state != TASK_RUNNING)
return;
schedule();
}

and if people start getting latency problems due to loops with state !=
TASK_RUNNING, then I suspect we might just make "set_current_state()"
check that case explicitly and do a conditional reschedule (ie make it the
same as if we released a lock). That would be a hell of a lot cleaner, in
my opinion.

Linus

2002-04-04 22:56:25

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On Thu, 2002-04-04 at 17:38, Andrew Morton wrote:

> With the appropriate locking, memory barriers and other
> relevant goo I think this would work...

Yah, I guess, but that isn't pretty at all ;)

Andrew, remember how we used to do it (and still do it in the 2.4
patch)? Wouldn't that work? Specifically, when we enter
preempt_schedule we set a flag value in preempt_count. This flag value
is checked at the top of schedule and, if set, we skip the first chunk
of code that handles sleeping tasks. The task->state never changes.
Upon leaving schedule and returning to preempt_schedule, we unset the
flag.

This allows us to preempt tasks in any state, without problems or
special cases. It also wasn't too much overhead - compared to now,
basically just:

if (unlikely(current->preempt_count() & PREEMPT_ACTIVE))
goto pick_next_task;

at the top of schedule().

Robert Love

2002-04-04 22:57:05

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() atboot time

Linus Torvalds wrote:
>
> On Thu, 4 Apr 2002, Andrew Morton wrote:
> > Another approach would be:
> >
> > preempt_schedule()
> > {
> > current->state2 = current->state;
> > current->state = TASK_RUNNING;
> > schedule();
> > current->state = current->state2;
> > }
>
> Yes, but please no.
>
> My current tree says
>
> asmlinkage void preempt_schedule(void)
> {
> if (unlikely(preempt_get_count()))
> return;
> if (current->state != TASK_RUNNING)
> return;
> schedule();
> }
>
> and if people start getting latency problems due to loops with state !=
> TASK_RUNNING, then I suspect we might just make "set_current_state()"
> check that case explicitly and do a conditional reschedule (ie make it the
> same as if we released a lock). That would be a hell of a lot cleaner, in
> my opinion.
>

That would work. And would also fix my "spin_unlock sometimes
stomps on TASK_INTERRUPTIBLE" problem.

It does mean that we'll need to convert many open-coded

current->state = whatever;

instances to use [__]set_current_state(). But that's not
a bad thing. Janitorial patches for this are already
floating about.

Robert, do you have time to do the code-and-test thing?

-

2002-04-04 23:12:57

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() atboot time

Robert Love wrote:
>
> ...
> This allows us to preempt tasks in any state, without problems or
> special cases. It also wasn't too much overhead - compared to now,
> basically just:
>
> if (unlikely(current->preempt_count() & PREEMPT_ACTIVE))
> goto pick_next_task;
>
> at the top of schedule().
>

That's changing schedule(). Seems that I'd ruled out that
option prematurely. As current->preempt_count() and PREEMPT_ACTIVE
can both evaluate to constant zero if CONFIG_PREEMPT=n, it can
be done ifdeflessly.

Everything happens inside rq->lock. Looks solid to me.

-

2002-04-04 23:09:47

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On Thu, 2002-04-04 at 17:42, Linus Torvalds wrote:

> My current tree says
>
> asmlinkage void preempt_schedule(void)
> {
> if (unlikely(preempt_get_count()))
> return;
> if (current->state != TASK_RUNNING)
> return;
> schedule();
> }

You need to do the same thing in entry.S ... Ingo decoupled the two
entry paths from each other (entry.S does preempt_schedule's work on its
own, and then directly calls schedule).

So, right now the ret_from_intr path sets the task state to
TASK_RUNNING. That needs to be replaced with a jump out if it is NOT
task running.

Curiously, what is wrong with the way we did it in the original patch?
I.e. set a bit in preempt_count and use that to skip to pick_next_task
in schedule. This allows us to preempt any task of any state ...

Robert Love

2002-04-04 23:16:27

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() atboot time

On Thu, 2002-04-04 at 18:10, Andrew Morton wrote:

> That's changing schedule(). Seems that I'd ruled out that
> option prematurely. As current->preempt_count() and PREEMPT_ACTIVE
> can both evaluate to constant zero if CONFIG_PREEMPT=n, it can
> be done ifdeflessly.
>
> Everything happens inside rq->lock. Looks solid to me.

It is solid - I did not just invent that approach, it was how we have
always done it until around 2.5.6-pre (Ingo sent a patch to change it).
The 2.4 patches do it this way (take a look) and 2.5 before the change
obviously worked like this.

It hasn't shown any problems in ~6 months of use.

I'll cook up a patch to do it, but I'd like to hear Linus opinion ...

Robert Love

2002-04-04 23:43:41

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On 4 Apr 2002, Robert Love wrote:
>
> Curiously, what is wrong with the way we did it in the original patch?
> I.e. set a bit in preempt_count and use that to skip to pick_next_task
> in schedule. This allows us to preempt any task of any state ...

Because that requires that every user of "set_task_state()" needs to know
about preemption.

Btw, I think entry.S should just call preempt_schedule() instead, instead
of knowing about these details.

Linus

2002-04-04 23:48:21

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On Thu, 2002-04-04 at 18:42, Linus Torvalds wrote:

> Because that requires that every user of "set_task_state()" needs to know
> about preemption.

Hm, how so? I contend not to rudely set the task state but instead mark
the task as "preempted" in preempt_schedule and handle this case in
schedule.

It requires zero change to anything else; this is the behavior of the
original patch I sent you.

> Btw, I think entry.S should just call preempt_schedule() instead, instead
> of knowing about these details.

Agreed.

Robert Love

2002-04-04 23:56:51

[permalink] [raw]

Subject: Re: Patch: linux-2.5.8-pre1/kernel/exit.c change caused BUG() at boot time

On 4 Apr 2002, Robert Love wrote:
>
> Hm, how so? I contend not to rudely set the task state but instead mark
> the task as "preempted" in preempt_schedule and handle this case in
> schedule.

Ahh, I misunderstood - I thought you meant setting the bit when setting
current->state.

Fair enough. Send me a patch to look at.

Linus

2002-04-05 00:04:12

[permalink] [raw]

Subject: [PATCH] preemptive kernel behavior change: don't be rude

On Thu, 2002-04-04 at 18:55, Linus Torvalds wrote:
> Fair enough. Send me a patch to look at.

Linus et all,

Here we go:

- do not manually set task->state
- instead, in preempt_schedule, set a flag in preempt_count that
denotes that this task is entering schedule off a kernel preemption.
- use this flag in schedule to jump to pick_next_task
- in preempt_schedule, upon return from schedule, unset the flag
- have entry.S just call preempt_schedule and not duplicate this work,
as Linus suggested. I agree. Note this makes debugging easier as
we keep a single point of entry for kernel preemptions.

The result: we can safely preempt non-TASK_RUNNING tasks. If one is
preempted, we can safely survive schedule because we won't handle the
special casing of non-TASK_RUNNING at the top of schedule. Thus other
tasks can run as desired and our non-TASK_RUNNING task will eventually
be rescheduled, in its original state, and complete happily.

This is the behavior we have in the 2.4 patches and 2.5 until
~2.5.6-pre. This works. It requires no other changes elsewhere (it
actually removes some special-casing Ingo did in the signal code).

This patch works in theory and compiles, but received minimal testing.

Robert Love

diff -urN linux-2.5.8-pre1/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
--- linux-2.5.8-pre1/arch/i386/kernel/entry.S Wed Apr 3 20:57:24 2002
+++ linux/arch/i386/kernel/entry.S Thu Apr 4 18:32:03 2002
@@ -240,9 +240,7 @@
jnz restore_all
incl TI_PRE_COUNT(%ebx)
sti
- movl TI_TASK(%ebx), %ecx # ti->task
- movl $0,(%ecx) # current->state = TASK_RUNNING
- call SYMBOL_NAME(schedule)
+ call SYMBOL_NAME(preempt_schedule)
jmp ret_from_intr
#endif

diff -urN linux-2.5.8-pre1/arch/i386/kernel/ptrace.c linux/arch/i386/kernel/ptrace.c
--- linux-2.5.8-pre1/arch/i386/kernel/ptrace.c Wed Apr 3 20:57:24 2002
+++ linux/arch/i386/kernel/ptrace.c Thu Apr 4 18:31:17 2002
@@ -455,11 +455,9 @@
between a syscall stop and SIGTRAP delivery */
current->exit_code = SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD)
? 0x80 : 0);
- preempt_disable();
current->state = TASK_STOPPED;
notify_parent(current, SIGCHLD);
schedule();
- preempt_enable();
/*
* this isn't the same as continuing with a signal, but it will do
* for normal use. strace only continues with a signal if the
diff -urN linux-2.5.8-pre1/arch/i386/kernel/signal.c linux/arch/i386/kernel/signal.c
--- linux-2.5.8-pre1/arch/i386/kernel/signal.c Wed Apr 3 20:57:24 2002
+++ linux/arch/i386/kernel/signal.c Thu Apr 4 18:31:00 2002
@@ -610,11 +610,9 @@
if ((current->ptrace & PT_PTRACED) && signr != SIGKILL) {
/* Let the debugger run. */
current->exit_code = signr;
- preempt_disable();
current->state = TASK_STOPPED;
notify_parent(current, SIGCHLD);
schedule();
- preempt_enable();

/* We're back. Did the debugger cancel the sig? */
if (!(signr = current->exit_code))
@@ -669,14 +667,12 @@

case SIGSTOP: {
struct signal_struct *sig;
+ current->state = TASK_STOPPED;
current->exit_code = signr;
sig = current->parent->sig;
- preempt_disable();
- current->state = TASK_STOPPED;
if (sig && !(sig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDSTOP))
notify_parent(current, SIGCHLD);
schedule();
- preempt_enable();
continue;
}

diff -urN linux-2.5.8-pre1/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.8-pre1/include/linux/sched.h Wed Apr 3 20:57:27 2002
+++ linux/include/linux/sched.h Thu Apr 4 18:29:53 2002
@@ -91,6 +91,7 @@
#define TASK_UNINTERRUPTIBLE 2
#define TASK_ZOMBIE 4
#define TASK_STOPPED 8
+#define PREEMPT_ACTIVE 0x4000000

#define __set_task_state(tsk, state_value) \
do { (tsk)->state = (state_value); } while (0)
diff -urN linux-2.5.8-pre1/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.8-pre1/kernel/sched.c Wed Apr 3 20:57:37 2002
+++ linux/kernel/sched.c Thu Apr 4 18:29:24 2002
@@ -764,6 +764,13 @@
prev->sleep_timestamp = jiffies;
spin_lock_irq(&rq->lock);

+ /*
+ * if entering from preempt_schedule, off a kernel preemption,
+ * go straight to picking the next task.
+ */
+ if (unlikely(preempt_get_count() & PREEMPT_ACTIVE))
+ goto pick_next_task;
+
switch (prev->state) {
case TASK_INTERRUPTIBLE:
if (unlikely(signal_pending(prev))) {
@@ -775,7 +782,7 @@
case TASK_RUNNING:
;
}
-#if CONFIG_SMP
+#if CONFIG_SMP || CONFIG_PREEMPT
pick_next_task:
#endif
if (unlikely(!rq->nr_running)) {
@@ -843,8 +850,11 @@
{
if (unlikely(preempt_get_count()))
return;
- current->state = TASK_RUNNING;
+
+ current_thread_info()->preempt_count += PREEMPT_ACTIVE;
schedule();
+ current_thread_info()->preempt_count -= PREEMPT_ACTIVE;
+ barrier();
}
#endif /* CONFIG_PREEMPT */

2002-04-05 01:55:39

[permalink] [raw]

Subject: Re: [PATCH] preemptive kernel behavior change: don't be rude

Robert Love wrote:
>
> On Thu, 2002-04-04 at 18:55, Linus Torvalds wrote:
> > Fair enough. Send me a patch to look at.
>
> Linus et all,
>
> Here we go:
>
> - do not manually set task->state
> - instead, in preempt_schedule, set a flag in preempt_count that
> denotes that this task is entering schedule off a kernel preemption.
> - use this flag in schedule to jump to pick_next_task
> - in preempt_schedule, upon return from schedule, unset the flag
> - have entry.S just call preempt_schedule and not duplicate this work,
> as Linus suggested. I agree. Note this makes debugging easier as
> we keep a single point of entry for kernel preemptions.
>
> The result: we can safely preempt non-TASK_RUNNING tasks. If one is
> preempted, we can safely survive schedule because we won't handle the
> special casing of non-TASK_RUNNING at the top of schedule. Thus other
> tasks can run as desired and our non-TASK_RUNNING task will eventually
> be rescheduled, in its original state, and complete happily.
>
> This is the behavior we have in the 2.4 patches and 2.5 until
> ~2.5.6-pre. This works. It requires no other changes elsewhere (it
> actually removes some special-casing Ingo did in the signal code).
>
> This patch works in theory and compiles, but received minimal testing.
>
> Robert Love
>
> diff -urN linux-2.5.8-pre1/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
> --- linux-2.5.8-pre1/arch/i386/kernel/entry.S Wed Apr 3 20:57:24 2002
> +++ linux/arch/i386/kernel/entry.S Thu Apr 4 18:32:03 2002
> @@ -240,9 +240,7 @@
> jnz restore_all
> incl TI_PRE_COUNT(%ebx)
> sti
> - movl TI_TASK(%ebx), %ecx # ti->task
> - movl $0,(%ecx) # current->state = TASK_RUNNING
> - call SYMBOL_NAME(schedule)
> + call SYMBOL_NAME(preempt_schedule)
> jmp ret_from_intr
> #endif
>
> diff -urN linux-2.5.8-pre1/arch/i386/kernel/ptrace.c linux/arch/i386/kernel/ptrace.c
> --- linux-2.5.8-pre1/arch/i386/kernel/ptrace.c Wed Apr 3 20:57:24 2002
> +++ linux/arch/i386/kernel/ptrace.c Thu Apr 4 18:31:17 2002
> @@ -455,11 +455,9 @@
> between a syscall stop and SIGTRAP delivery */
> current->exit_code = SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD)
> ? 0x80 : 0);
> - preempt_disable();
> current->state = TASK_STOPPED;
> notify_parent(current, SIGCHLD);
> schedule();
> - preempt_enable();
> /*
> * this isn't the same as continuing with a signal, but it will do
> * for normal use. strace only continues with a signal if the
> diff -urN linux-2.5.8-pre1/arch/i386/kernel/signal.c linux/arch/i386/kernel/signal.c
> --- linux-2.5.8-pre1/arch/i386/kernel/signal.c Wed Apr 3 20:57:24 2002
> +++ linux/arch/i386/kernel/signal.c Thu Apr 4 18:31:00 2002
> @@ -610,11 +610,9 @@
> if ((current->ptrace & PT_PTRACED) && signr != SIGKILL) {
> /* Let the debugger run. */
> current->exit_code = signr;
> - preempt_disable();
> current->state = TASK_STOPPED;
> notify_parent(current, SIGCHLD);
> schedule();
> - preempt_enable();
>
> /* We're back. Did the debugger cancel the sig? */
> if (!(signr = current->exit_code))
> @@ -669,14 +667,12 @@
>
> case SIGSTOP: {
> struct signal_struct *sig;
> + current->state = TASK_STOPPED;
> current->exit_code = signr;
> sig = current->parent->sig;
> - preempt_disable();
> - current->state = TASK_STOPPED;
> if (sig && !(sig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDSTOP))
> notify_parent(current, SIGCHLD);
> schedule();
> - preempt_enable();
> continue;
> }
>
> diff -urN linux-2.5.8-pre1/include/linux/sched.h linux/include/linux/sched.h
> --- linux-2.5.8-pre1/include/linux/sched.h Wed Apr 3 20:57:27 2002
> +++ linux/include/linux/sched.h Thu Apr 4 18:29:53 2002
> @@ -91,6 +91,7 @@
> #define TASK_UNINTERRUPTIBLE 2
> #define TASK_ZOMBIE 4
> #define TASK_STOPPED 8
> +#define PREEMPT_ACTIVE 0x4000000
>
> #define __set_task_state(tsk, state_value) \
> do { (tsk)->state = (state_value); } while (0)
> diff -urN linux-2.5.8-pre1/kernel/sched.c linux/kernel/sched.c
> --- linux-2.5.8-pre1/kernel/sched.c Wed Apr 3 20:57:37 2002
> +++ linux/kernel/sched.c Thu Apr 4 18:29:24 2002
> @@ -764,6 +764,13 @@
> prev->sleep_timestamp = jiffies;
> spin_lock_irq(&rq->lock);
>
> + /*
> + * if entering from preempt_schedule, off a kernel preemption,
> + * go straight to picking the next task.
> + */
> + if (unlikely(preempt_get_count() & PREEMPT_ACTIVE))
> + goto pick_next_task;
> +
> switch (prev->state) {
> case TASK_INTERRUPTIBLE:
> if (unlikely(signal_pending(prev))) {
> @@ -775,7 +782,7 @@
> case TASK_RUNNING:
> ;
> }
> -#if CONFIG_SMP
> +#if CONFIG_SMP || CONFIG_PREEMPT
> pick_next_task:
> #endif
> if (unlikely(!rq->nr_running)) {
> @@ -843,8 +850,11 @@
> {
> if (unlikely(preempt_get_count()))
> return;

This line implies that entry.S calls with preempt count of 0. It use to
call with 1 or was that WAY back there?

If this is the way it is and it works, then the += and -= below can be
changed to = and the second reference to PREEMPT_ACTIVE becomes 0.

I think I would rather have entry.S set preempt_count to PREEMPT_ACTIVE
with the interrupt system off, turn it on, make the call directly to
schedule(), and then set to zero on return. I really am concerned with
taking an interrupt during the call, i.e. between the interrupt on and
the store below. This can lead to stack overflow rather easily.

> - current->state = TASK_RUNNING;
> +
> + current_thread_info()->preempt_count += PREEMPT_ACTIVE;
> schedule();
> + current_thread_info()->preempt_count -= PREEMPT_ACTIVE;
> + barrier();
> }
> #endif /* CONFIG_PREEMPT */
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2002-04-05 02:07:31

[permalink] [raw]

Subject: Re: [PATCH] preemptive kernel behavior change: don't be rude

On Thu, 2002-04-04 at 20:51, george anzinger wrote:

> This line implies that entry.S calls with preempt count of 0. It use to
> call with 1 or was that WAY back there?

No you are right, this line needs to be pushed back out to
preempt_enable OR we need to duplicate preempt_schedule in the entry.S
code. For now, I will just push this back into preempt_enable.

> If this is the way it is and it works, then the += and -= below can be
> changed to = and the second reference to PREEMPT_ACTIVE becomes 0.
>
> I think I would rather have entry.S set preempt_count to PREEMPT_ACTIVE
> with the interrupt system off, turn it on, make the call directly to
> schedule(), and then set to zero on return. I really am concerned with
> taking an interrupt during the call, i.e. between the interrupt on and
> the store below. This can lead to stack overflow rather easily.

Agreed. This is probably the best way to do it ... for now I'll do it
like above.

Robert Love

2002-04-05 03:09:31