Linus,
The current in_atomic() check fails with kernel preemption enabled since
we set preempt_count to PREEMPT_ACTIVE in preempt_schedule().
We need to additionally check whether PREEMPT_ACTIVE is set.
There is also still the issue that bugging out is a bit drastic and a
hindrance to debugging; but I will tackle that later. For now, please
apply this so we can at least boot with preemption enabled.
Patch is against 2.5.35-bk.
Robert Love
diff -urN linux-2.5.35/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.35/kernel/sched.c Sun Sep 15 22:18:24 2002
+++ linux/kernel/sched.c Mon Sep 16 14:43:54 2002
@@ -940,8 +940,7 @@
struct list_head *queue;
int idx;
- if (unlikely(in_atomic()))
- BUG();
+ BUG_ON(in_atomic() && preempt_count() != PREEMPT_ACTIVE);
#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
@@ -959,7 +958,7 @@
* if entering off of a kernel preemption go straight
* to picking the next task.
*/
- if (unlikely(preempt_count() & PREEMPT_ACTIVE))
+ if (unlikely(preempt_count() == PREEMPT_ACTIVE))
goto pick_next_task;
switch (prev->state) {
On 16 Sep 2002, Robert Love wrote:
>
> The current in_atomic() check fails with kernel preemption enabled since
> we set preempt_count to PREEMPT_ACTIVE in preempt_schedule().
>
> We need to additionally check whether PREEMPT_ACTIVE is set.
Would it not be a lot better to just mask off PREEMPT_ACTIVE() instead of
checking for it explicitly.
The in_interrupt() etc stuff already effectively do this by masking off
the HARDIRQ_MASK etc. I would prefer a patch to hardirq.h that just adds a
#define to make preempt_count() not contain PREEMPT_ACTIVE - and make the
PREEMPT_ACTIVE checks be a totally separate check (logic: it's not a
count, so it shouldn't show up in preempt_count())
> There is also still the issue that bugging out is a bit drastic and a
> hindrance to debugging; but I will tackle that later. For now, please
> apply this so we can at least boot with preemption enabled.
I certainly wouldn't mind the DEBUG/WARNING/FATAL infrastructure discussed
earlier..
Linus
On Mon, 2002-09-16 at 15:01, Linus Torvalds wrote:
> Would it not be a lot better to just mask off PREEMPT_ACTIVE() instead of
> checking for it explicitly.
>
> The in_interrupt() etc stuff already effectively do this by masking off
> the HARDIRQ_MASK etc. I would prefer a patch to hardirq.h that just adds a
> #define to make preempt_count() not contain PREEMPT_ACTIVE - and make the
> PREEMPT_ACTIVE checks be a totally separate check (logic: it's not a
> count, so it shouldn't show up in preempt_count())
I liked this idea, and was working on implementing it when I ran into a
few roadblocks. Your ideas are welcome.
First, "preempt_count()" is used as an l-value in a lot of places, i.e.
look at all the "preempt_count() += foo" in the IRQ code. We cannot
mask things out of it.
Thus, I then looked into doing a separate function for the raw value,
say an "atomic_count()" ... the code just looked ugly mixing
"atomic_count()" and "preempt_count()" for no apparent reason.
Third, PREEMPT_ACTIVE actually _is_ part of the count. It helps assure
us a task is not preempted repeatedly. If we did not have it, we would
have to bump preempt_count on preemption. So we still need it in the
preempt_count().
Simplest solution is to:
#define in_atomic() \
(preempt_count() & ~PREEMPT_ACTIVE) != kernel_locked())
although I still dislike the masking just to make the schedule()
code-path cleaner.
Oh, and there is another problem: printk() from schedule() implicitly
calls wake_up(). My machine dies even with just a printk() and not a
BUG()... I suspect there may be some SMP issue in that whole mess too,
because setting oops_in_progress prior did not help.
Comments?
Robert Love
On 16 Sep 2002, Robert Love wrote:
>
> I liked this idea, and was working on implementing it when I ran into a
> few roadblocks. Your ideas are welcome.
>
> First, "preempt_count()" is used as an l-value in a lot of places, i.e.
> look at all the "preempt_count() += foo" in the IRQ code. We cannot
> mask things out of it.
Ok. Let's just make the masking explicit in in_atomic() then, like you
suggest:
> Simplest solution is to:
>
> #define in_atomic() \
> (preempt_count() & ~PREEMPT_ACTIVE) != kernel_locked())
>
> although I still dislike the masking just to make the schedule()
> code-path cleaner.
I don't think this is a scheduler cleanliness issue: it's a consistency
issue. If "in_interrupt()" and friends do not care about PREEMPT_ACTIVE,
then neither should "in_atomic()". The fact that the scheduler test gets
cleaned up is secondary - although it is obviously a result of being
consistent.
> Oh, and there is another problem: printk() from schedule() implicitly
> calls wake_up(). My machine dies even with just a printk() and not a
> BUG()... I suspect there may be some SMP issue in that whole mess too,
> because setting oops_in_progress prior did not help.
Hmm.. It will call wake_up() because it will try to wake up any klogd.
What's the problem? Calling wake_up() should be fine from there..
Linus
On Mon, 2002-09-16 at 17:41, Linus Torvalds wrote:
> Ok. Let's just make the masking explicit in in_atomic() then, like you
> suggest:
OK.
But, ugh, more fun: we preempt_disable() in do_exit(). Every exiting
task hits the test. My syslog is huge.
At least for now, can we please revert the check to in_interrupt() ?
Robert Love
diff -urN linux-2.5.35/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.35/kernel/sched.c Sun Sep 15 22:18:24 2002
+++ linux/kernel/sched.c Mon Sep 16 18:17:06 2002
@@ -940,7 +940,7 @@
struct list_head *queue;
int idx;
- if (unlikely(in_atomic()))
+ if (unlikely(in_interrupt()))
BUG();
#if CONFIG_DEBUG_HIGHMEM
On 16 Sep 2002, Robert Love wrote:
>
> But, ugh, more fun: we preempt_disable() in do_exit(). Every exiting
> task hits the test. My syslog is huge.
>
> At least for now, can we please revert the check to in_interrupt() ?
I really think the test is correct, and if we revert it now, we certainly
won't be able to re-introduce it later when we're closer to 2.6.
So if the in_atomic() change is enough to fix everything but do_exit(),
then how about just making do_exit() use PREEMPT_ACTIVE instead?
Linus
On Mon, 2002-09-16 at 18:26, Linus Torvalds wrote:
> On 16 Sep 2002, Robert Love wrote:
> >
> > At least for now, can we please revert the check to in_interrupt() ?
>
> I really think the test is correct, and if we revert it now, we certainly
> won't be able to re-introduce it later when we're closer to 2.6.
>
> So if the in_atomic() change is enough to fix everything but do_exit(),
> then how about just making do_exit() use PREEMPT_ACTIVE instead?
Nope. If PREEMPT_ACTIVE is set, schedule() assumes the task is being
preempted and skips certain logic e.g. deactivate_task() (this is the
same code that lets us safely preempt a TASK_ZOMBIE).
Result is death before init even executes.
Ugh...
Robert Love
On 16 Sep 2002, Robert Love wrote:
>
> Nope. If PREEMPT_ACTIVE is set, schedule() assumes the task is being
> preempted and skips certain logic e.g. deactivate_task() (this is the
> same code that lets us safely preempt a TASK_ZOMBIE).
Ahhah! I know. You just make lock_depth 0 when you exit, without actually
taking the kernel lock. Which fools the logic into accepting a
preempt-disable, since it thinks that the preempt disable is due to
holding the kernel lock.
Add a big comment and you're all done.
Linus
On Mon, 2002-09-16 at 19:45, Linus Torvalds wrote:
> Ahhah! I know. You just make lock_depth 0 when you exit, without actually
> taking the kernel lock. Which fools the logic into accepting a
> preempt-disable, since it thinks that the preempt disable is due to
> holding the kernel lock.
I was this -> <- close to celebrating. Not so fast, smarty.
What about release_kernel_lock() ?
It sees task->lock_depth>=0 and calls spin_unlock() on a lock that it
does not hold.
Robert Love
On 16 Sep 2002, Robert Love wrote:
>
> I was this -> <- close to celebrating. Not so fast, smarty.
You forget - I'm not only a smarty, I'm sick and twisted too.
> What about release_kernel_lock() ?
>
> It sees task->lock_depth>=0 and calls spin_unlock() on a lock that it
> does not hold.
We have a simple rule:
- task->lock_depth = -1 means "no lock held"
- task->lock_depth >= 0 means "BKL really held"
... but what does "task->lock_depth < -1" mean?
Yup: "validly nonpreemptable".
So then you have:
#define kernel_locked() (current->lock_depth >= 0)
#define in_atomic() (preempt_count() & ~PREEMPT_ACTIVE) != (current->lock_depth != -1))
and you're all set - just set lock_depth to -2 when you exit to show that
you don't hold the BKL, but that you are validly not preemtable.
I get the award for having the most disgusting ideas.
Linus "but it works and is efficient" Torvalds
On Tue, 2002-09-17 at 01:56, Linus Torvalds wrote:
> We have a simple rule:
> - task->lock_depth = -1 means "no lock held"
> - task->lock_depth >= 0 means "BKL really held"
>
> ... but what does "task->lock_depth < -1" mean?
>
> Yup: "validly nonpreemptable".
>
> So then you have:
>
> #define kernel_locked() (current->lock_depth >= 0)
>
> #define in_atomic() (preempt_count() & ~PREEMPT_ACTIVE) != (current->lock_depth != -1))
>
> and you're all set - just set lock_depth to -2 when you exit to show that
> you don't hold the BKL, but that you are validly not preemtable.
>
> I get the award for having the most disgusting ideas.
Yah, you do. Good job though.
I implemented exactly what you detailed, with one change: we need to
check kernel_locked() before setting lock_depth because it is valid to
exit() while holding the BKL. Aside from kernel code that does it
intentionally, crashed code (e.g. modules) would have the same problem.
Anyhow, this works. Finally.
Except the printk() still causes my system to lockup, but that is
another (tomorrow's) issue.
Patch is against current BK. Thanks for the help.
Robert Love
diff -urN linux-2.5.35/include/asm-i386/hardirq.h linux/include/asm-i386/hardirq.h
--- linux-2.5.35/include/asm-i386/hardirq.h Sun Sep 15 22:18:46 2002
+++ linux/include/asm-i386/hardirq.h Tue Sep 17 03:20:00 2002
@@ -77,7 +77,8 @@
#define irq_enter() (preempt_count() += HARDIRQ_OFFSET)
#if CONFIG_PREEMPT
-# define in_atomic() (preempt_count() != kernel_locked())
+# define in_atomic() \
+ ((preempt_count() & ~PREEMPT_ACTIVE) != (current->lock_depth != -1))
# define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1)
#else
# define in_atomic() (preempt_count() != 0)
diff -urN linux-2.5.35/kernel/exit.c linux/kernel/exit.c
--- linux-2.5.35/kernel/exit.c Tue Sep 17 03:18:53 2002
+++ linux/kernel/exit.c Tue Sep 17 03:55:12 2002
@@ -637,6 +637,21 @@
preempt_disable();
if (current->exit_signal == -1)
release_task(current);
+
+ /*
+ * This little bit of genius comes from the twisted mind of Linus.
+ * We need exit() to be atomic but we also want a debugging check
+ * in schedule() to whine if we are atomic. The wickedness is in
+ * these rules:
+ * - task->lock_depth = -2 means "validly nonpreemptable"
+ * - task->lock_depth = -1 means "BKL not held"
+ * - task->lock_depth >= 0 means "BKL held"
+ * release_kernel_lock and kernel_locked() check >=0, and
+ * in_atomic() checks != -1... the "fake BKL" will "cancel out"
+ * the preempt_disable() above and everyone is happy.
+ */
+ if (unlikely(!kernel_locked()))
+ tsk->lock_depth = -2;
schedule();
BUG();
/*
diff -urN linux-2.5.35/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.35/kernel/sched.c Sun Sep 15 22:18:24 2002
+++ linux/kernel/sched.c Tue Sep 17 03:50:04 2002
@@ -940,8 +940,10 @@
struct list_head *queue;
int idx;
- if (unlikely(in_atomic()))
- BUG();
+ if (unlikely(in_atomic())) {
+ printk(KERN_ERR "error: scheduling while non-atomic!\n");
+ dump_stack();
+ }
#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
@@ -959,7 +961,7 @@
* if entering off of a kernel preemption go straight
* to picking the next task.
*/
- if (unlikely(preempt_count() & PREEMPT_ACTIVE))
+ if (unlikely(preempt_count() == PREEMPT_ACTIVE))
goto pick_next_task;
switch (prev->state) {
On Tue, 2002-09-17 at 04:12, Robert Love wrote:
> + if (unlikely(!kernel_locked()))
> + tsk->lock_depth = -2;
Eh, this should be likely()...
Robert Love
diff -urN linux-2.5.35/include/asm-i386/hardirq.h linux/include/asm-i386/hardirq.h
--- linux-2.5.35/include/asm-i386/hardirq.h Sun Sep 15 22:18:46 2002
+++ linux/include/asm-i386/hardirq.h Tue Sep 17 03:20:00 2002
@@ -77,7 +77,8 @@
#define irq_enter() (preempt_count() += HARDIRQ_OFFSET)
#if CONFIG_PREEMPT
-# define in_atomic() (preempt_count() != kernel_locked())
+# define in_atomic() \
+ ((preempt_count() & ~PREEMPT_ACTIVE) != (current->lock_depth != -1))
# define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1)
#else
# define in_atomic() (preempt_count() != 0)
diff -urN linux-2.5.35/kernel/exit.c linux/kernel/exit.c
--- linux-2.5.35/kernel/exit.c Tue Sep 17 03:18:53 2002
+++ linux/kernel/exit.c Tue Sep 17 03:55:12 2002
@@ -637,6 +637,21 @@
preempt_disable();
if (current->exit_signal == -1)
release_task(current);
+
+ /*
+ * This little bit of genius comes from the twisted mind of Linus.
+ * We need exit() to be atomic but we also want a debugging check
+ * in schedule() to whine if we are atomic. The wickedness is in
+ * these rules:
+ * - task->lock_depth = -2 means "validly nonpreemptable"
+ * - task->lock_depth = -1 means "BKL not held"
+ * - task->lock_depth >= 0 means "BKL held"
+ * release_kernel_lock and kernel_locked() check >=0, and
+ * in_atomic() checks != -1... the "fake BKL" will "cancel out"
+ * the preempt_disable() above and the world is happy.
+ */
+ if (likely(!kernel_locked()))
+ tsk->lock_depth = -2;
schedule();
BUG();
/*
diff -urN linux-2.5.35/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.35/kernel/sched.c Sun Sep 15 22:18:24 2002
+++ linux/kernel/sched.c Tue Sep 17 03:50:04 2002
@@ -940,8 +940,10 @@
struct list_head *queue;
int idx;
- if (unlikely(in_atomic()))
- BUG();
+ if (unlikely(in_atomic())) {
+ printk(KERN_ERR "error: scheduling while non-atomic!\n");
+ dump_stack();
+ }
#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
@@ -959,7 +961,7 @@
* if entering off of a kernel preemption go straight
* to picking the next task.
*/
- if (unlikely(preempt_count() & PREEMPT_ACTIVE))
+ if (unlikely(preempt_count() == PREEMPT_ACTIVE))
goto pick_next_task;
switch (prev->state) {
On Tue, 2002-09-17 at 04:12, Robert Love wrote:
> I implemented exactly what you detailed, with one change: we need to
> check kernel_locked() before setting lock_depth because it is valid to
> exit() while holding the BKL. Aside from kernel code that does it
> intentionally, crashed code (e.g. modules) would have the same problem.
fsck, this is non-ending... obviously, this is insufficient:
preempt_count will equal two if the BKL was previously held and
in_atomic() will trip.
This should work:
if (likely(!kernel_locked())
tsk->lock_depth = -2;
else {
/* compensate for BKL; we still cannot preempt */
preempt_enable_no_resched();
}
look sane?
Now, remind me why this is all worth it...
Robert Love
On 17 Sep 2002, Robert Love wrote:
[...]
> Now, remind me why this is all worth it...
having preemption support that 1) is correct 2) works?
We *must* use the schedule() check to debug preemption bugs, or we wont
have usable preemption in 2.6, i dont really understand why your are not
happy that we have such a great tool. In fact we should also add other
debugging bits, like 'check for !0 preemption count in smp_processor_id()'
, and the underflow checks that caught the IDE bug. These are all bits
that help the elimination of preemption bugs which are also often SMP
bugs, on plain UP boxes.
Ingo
On Tue, 2002-09-17 at 02:59, Robert Love wrote:
> On Tue, 2002-09-17 at 04:12, Robert Love wrote:
>
> > I implemented exactly what you detailed, with one change: we need to
> > check kernel_locked() before setting lock_depth because it is valid to
> > exit() while holding the BKL. Aside from kernel code that does it
> > intentionally, crashed code (e.g. modules) would have the same problem.
>
> fsck, this is non-ending... obviously, this is insufficient:
> preempt_count will equal two if the BKL was previously held and
> in_atomic() will trip.
>
> This should work:
>
> if (likely(!kernel_locked())
> tsk->lock_depth = -2;
> else {
> /* compensate for BKL; we still cannot preempt */
> preempt_enable_no_resched();
> }
>
Robert,
I applied the modified patch you sent at 04:51 plus the above and
had to change dump_stack() to show_stack(NULL) in sched.c due to
kernel/kernel.o: In function `schedule':
kernel/kernel.o(.text+0xf13): undefined reference to `dump_stack'
I used plain vanilla 2.5.35 to patch against.
I booted that so-patched kernel and it got much further than before,
up to where syslogd was able to write some stuff to /var/log/messages.
There were many instances of "error: scheduling while non-atomic!",
and the traces were similar. Here is a decoded one:
[steven@spc5 linux-2.5.35]$ ksymoops -K -L -O -v vmlinux -m System.map <non-atomic.txt
ksymoops 2.4.4 on i686 2.5.35. Options used
-v vmlinux (specified)
-K (specified)
-L (specified)
-O (specified)
-m System.map (specified)
f7739f50 c0282f00 f7771960 c0337f20 c011c51b c1b0eec0 c1b71ba0 f7738000
f7738000 f7738000 f78d4da0 c011d08b f78eb380 c012d67f f78d01c0 f78eb380
f78eb560 f78d01c0 f78d01dc 40015000 bffffc48 c012d6c5 4213030c 00000004
Call Trace: [<c011c51b>] [<c011d08b>] [<c012d67f>] [<c012d6c5>] [<c010918f>]
Warning (Oops_read): Code line not seen, dumping what data is available
Trace; c011c51b <put_files_struct+bb/d0>
Trace; c011d08b <do_exit+35b/370>
Trace; c012d67f <do_munmap+11f/130>
Trace; c012d6c5 <sys_munmap+35/60>
Trace; c010918f <syscall_call+7/b>
Hope this helps,
Steven
On 17 Sep 2002, Robert Love wrote:
>
> Now, remind me why this is all worth it...
I really think we need this to have a stable system. Alan still thinks
preempt is unstable, and this helps counter some of that - by adding
sanity checks that all the counters are doing the right thing.
Also, it's one more reason to support preemption in the first place.
Having lower latency is all fine, but not everybody cares and clearly
preemption adds its own overhead. Having the preemption infrastructure add
its own set of help (ie helping find not only preempt-related bugs, but
spinlock bugs too) makes preempt all the more useful.
In particular, with preempt we can add atomicity debugging to "put_user()"
and friends, to verify that nobody tries to do user-mode accesses when
they aren't allowed to - another use for "in_atomic()". But that's
requires a functioning in_atomic() that isn't too limited to be generally
used..
Linus
On Tue, 17 Sep 2002, Linus Torvalds wrote:
>
> I really think we need this to have a stable system. Alan still thinks
> preempt is unstable, and this helps counter some of that - by adding
> sanity checks that all the counters are doing the right thing.
That said, the exit() crap does end up being a bit unreadable. How about
just splitting up schedule() into the "normal schedule" and the "exit()"
case. In particular, there are a few other things that are illegal for
the non-exit case anyway, and that we migth as well check for. There are
also potentially things we might want to do in the exit case that we don't
want to do in the hot-path.
So we could just rename the current core "schedule()" functionality as
"static void do_schedule()", and then have
void schedule(void)
{
BUG_ON(in_atomic());
BUG_ON(current->state & TASK_ZOMBIE);
do_schedule();
}
void exit_schedule(void)
{
BUG_ON(!(current->state & TASK_ZOMBIE));
do_schedule();
}
which is simpler than playing games with preempt_count == -2, and also
allows us more sanity checking than we have right now (the TASK_ZOMBIE
thing is also a exit-specific thing, and there may well be others too).
Linus
On Tue, 17 Sep 2002, Linus Torvalds wrote:
> void schedule(void)
> {
> BUG_ON(in_atomic());
> BUG_ON(current->state & TASK_ZOMBIE);
> do_schedule();
> }
>
> void exit_schedule(void)
> {
> BUG_ON(!(current->state & TASK_ZOMBIE));
> do_schedule();
> }
i'd rather do it the other way around - ie. do a:
if (unlikely(current->state == TASK_ZOMBIE)) {
... exit checks ...
} else {
... normal checks ...
}
this test is cheaper than a function call. I'd say the ZOMBIE check alone
is not significant enough to introduce a function split in the hotpath.
This check is also easier to remove later on.
TASK_ZOMBIE is only ever allowed to be set for exiting tasks. [and if
TASK_ZOMBIE handling is broken then it will trigger the normal atomic
tests so it's not like we are silently ignoring those errors.]
Ingo
ie. for the time being, something like:
--- linux/kernel/sched.c.orig Tue Sep 17 17:53:31 2002
+++ linux/kernel/sched.c Tue Sep 17 17:54:10 2002
@@ -940,8 +940,9 @@
struct list_head *queue;
int idx;
- if (unlikely(in_atomic()))
- BUG();
+ if (likely(current->state != TASK_ZOMBIE))
+ if (unlikely(in_atomic()))
+ BUG();
#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
On Tue, 17 Sep 2002, Linus Torvalds wrote:
> On the other hand, we do have other ways to test the preempt count
> inside the scheduler. In particular, we might just move the
> "in_atomic()" check a few lines downwards, at which point we've released
> the kernel lock and explicitly disabled preemption, so at that point the
> test should be even simpler with fewer conditionals..
indeed ...
Ingo
On Tue, 17 Sep 2002, Ingo Molnar wrote:
>
> this test is cheaper than a function call.
Ehh.. On most architectures an unconditional direct function call is
_faster_ than a conditional test that does not predict all that well. So
the "test is faster than a function call" is not a very good optimization,
I suspect.
In fact, we'd probably be better off trying to factor out more out of
schedule() rather than making it even bigger. We've done that before: the
timeout stuff was done that way, and meant that normal reschedules no
longer need to care about timeouts. That made things simpler.
On the other hand, we do have other ways to test the preempt count inside
the scheduler. In particular, we might just move the "in_atomic()" check a
few lines downwards, at which point we've released the kernel lock and
explicitly disabled preemption, so at that point the test should be even
simpler with fewer conditionals..
Linus
On Tue, 2002-09-17 at 05:57, Ingo Molnar wrote:
> We *must* use the schedule() check to debug preemption bugs, or we wont
> have usable preemption in 2.6, i dont really understand why your are not
> happy that we have such a great tool. In fact we should also add other
> debugging bits, like 'check for !0 preemption count in smp_processor_id()'
> , and the underflow checks that caught the IDE bug. These are all bits
> that help the elimination of preemption bugs which are also often SMP
> bugs, on plain UP boxes.
Hm, sorry if I sound like I do not want something "so great". I do. I
just do not ever want to compromise the existing code... I would much
prefer to say "wow we cannot do this cleanly now, let's wait until we
figure out a clean way".
Anyhow, one of us is confused. How can this in_atomic() test _ever_
catch a preemption bug? We cannot enter the scheduler off kernel
preemption unless preempt_count==0. This is a test to catch bugs in
other parts of the kernel, e.g. where code explicitly calls schedule()
while holding a lock.
Robert Love
On Tue, 2002-09-17 at 10:10, Steven Cole wrote:
> I booted that so-patched kernel and it got much further than before,
> up to where syslogd was able to write some stuff to /var/log/messages.
That is what is happening to me... if you trace when you lockup, its due
to the printk. My machines in these cases are only livelocked, I can
still ping etc.
> Trace; c011c51b <put_files_struct+bb/d0>
> Trace; c011d08b <do_exit+35b/370>
> Trace; c012d67f <do_munmap+11f/130>
> Trace; c012d6c5 <sys_munmap+35/60>
> Trace; c010918f <syscall_call+7/b>
Ugh this looks like the exit path which should not be triggered.
Robert Love
On 17 Sep 2002, Robert Love wrote:
> [...] How can this in_atomic() test _ever_ catch a preemption bug? We
> cannot enter the scheduler off kernel preemption unless
> preempt_count==0. This is a test to catch bugs in other parts of the
> kernel, e.g. where code explicitly calls schedule() while holding a
> lock.
you are right, i was confusing this with the older check for disabled
interrupt in preempt_schedule() [which i'd still find useful].
The smp_processor_id() test catches true preemption bugs. So does
preempt_count() underflow detection.
i do agree with Alan - there can be nothing bad in trying to fix all that
non-preempt-aware code right now, before it becomes too late.
Ingo
On 17 Sep 2002, Robert Love wrote:
> OK so do we want to do (a):
>
> (moved down to after the preempt_disable() and release_kernel_lock())
>
> if (likely(current->state != TASK_ZOMBIE)
> if (unlikely((preempt_count() & ~PREEMPT_ACTIVE) != 1))
> ...
>
> or go with (b) where we split schedule() into schedule(),
> exit_schedule(), and do_schedule().
i'd do (a). current->state is to be used anyway, and the default-untaken
first branch should be cheap. Plus by moving things down the splitup of
the function would create more code duplication than necessery i think.
Ingo
On Tue, 2002-09-17 at 12:29, Robert Love wrote:
> On Tue, 2002-09-17 at 10:10, Steven Cole wrote:
>
> > I booted that so-patched kernel and it got much further than before,
> > up to where syslogd was able to write some stuff to /var/log/messages.
>
> That is what is happening to me... if you trace when you lockup, its due
> to the printk. My machines in these cases are only livelocked, I can
> still ping etc.
>
> > Trace; c011c51b <put_files_struct+bb/d0>
> > Trace; c011d08b <do_exit+35b/370>
> > Trace; c012d67f <do_munmap+11f/130>
> > Trace; c012d6c5 <sys_munmap+35/60>
> > Trace; c010918f <syscall_call+7/b>
>
> Ugh this looks like the exit path which should not be triggered.
>
I grabbed the 2.5.35-bk3 patch provided by Jeff Garzik so that I could
test something even more current and applied your patches to that (one
hunk was already there in -bk3).
I ran several of the stack dumps through ksymoops. Perhaps some of
these might point to something interesting.
Steven
ksymoops 2.4.4 on i686 2.5.35. Options used
-v vmlinux (specified)
-K (specified)
-L (specified)
-O (specified)
-m System.map (specified)
f773df4c c01167a6 c02834a0 f77801c0 c0337f20 c011c4eb c1b0eec0 f7936c60
f773c000 f773c000 f773c000 f77021a0 c011d05b 00000000 0000a9db f77021a0
c0122340 f7744760 f78cd2e0 f7744760 f773c000 bffffd08 c0122522 4213030c
Call Trace: [<c01167a6>] [<c011c4eb>] [<c011d05b>] [<c0122340>] [<c0122522>]
[<c010918f>]
f773df4c c01167a6 c02834a0 f77801c0 c0337f20 c011c4eb c1b0eec0 f7936ee0
f773c000 f773c000 f773c000 f77021a0 c011d05b c022305b 00000003 bfffea00
0000001c 00000000 0804b3c8 00000014 00000003 bfffea00 0000001c 4213030c
Call Trace: [<c01167a6>] [<c011c4eb>] [<c011d05b>] [<c022305b>] [<c010918f>]
f7739f4c c01167a6 c02834a0 f77801c0 c0337f20 c011c4eb c1b0eec0 f7936820
f7738000 f7738000 f7738000 f772a800 c011d05b 00000000 f7738000 bffff7a4
f7739fb0 0807e210 04000000 42029098 00000000 00000000 00000000 4213030c
Call Trace: [<c01167a6>] [<c011c4eb>] [<c011d05b>] [<c010918f>]
f7789f4c c01167a6 c02834a0 f7c5e760 c0337f20 c011c4eb c1b0eec0 c1b71da0
f7788000 f7788000 f7788000 f7ea8d60 c011d05b 0000006e 00000000 00000005
c0222fb0 00000005 f77fa1a0 fffffff7 00000005 bffffb28 c0142fe3 4213030c
Call Trace: [<c01167a6>] [<c011c4eb>] [<c011d05b>] [<c0222fb0>] [<c0142fe3>]
[<c010918f>]
Warning (Oops_read): Code line not seen, dumping what data is available
Trace; c01167a6 <schedule+36/3d0>
Trace; c011c4eb <put_files_struct+bb/d0>
Trace; c011d05b <do_exit+35b/370>
Trace; c0122340 <process_timeout+0/10>
Trace; c0122522 <sys_nanosleep+122/1a0>
Trace; c010918f <syscall_call+7/b>
Trace; c01167a6 <schedule+36/3d0>
Trace; c011c4eb <put_files_struct+bb/d0>
Trace; c011d05b <do_exit+35b/370>
Trace; c022305b <sys_socketcall+13b/200>
Trace; c010918f <syscall_call+7/b>
Trace; c01167a6 <schedule+36/3d0>
Trace; c011c4eb <put_files_struct+bb/d0>
Trace; c011d05b <do_exit+35b/370>
Trace; c010918f <syscall_call+7/b>
Trace; c01167a6 <schedule+36/3d0>
Trace; c011c4eb <put_files_struct+bb/d0>
Trace; c011d05b <do_exit+35b/370>
Trace; c0222fb0 <sys_socketcall+90/200>
Trace; c0142fe3 <sys_write+33/40>
Trace; c010918f <syscall_call+7/b>
1 warning issued. Results may not be reliable.
On Tue, 2002-09-17 at 12:26, Ingo Molnar wrote:
> On Tue, 17 Sep 2002, Linus Torvalds wrote:
>
> > On the other hand, we do have other ways to test the preempt count
> > inside the scheduler. In particular, we might just move the
> > "in_atomic()" check a few lines downwards, at which point we've released
> > the kernel lock and explicitly disabled preemption, so at that point the
> > test should be even simpler with fewer conditionals..
>
> indeed ...
OK so do we want to do (a):
(moved down to after the preempt_disable() and release_kernel_lock())
if (likely(current->state != TASK_ZOMBIE)
if (unlikely((preempt_count() & ~PREEMPT_ACTIVE) != 1))
...
or go with (b) where we split schedule() into schedule(),
exit_schedule(), and do_schedule().
Robert Love
On Tue, 2002-09-17 at 14:57, Ingo Molnar wrote:
> i'd do (a). current->state is to be used anyway, and the default-untaken
> first branch should be cheap. Plus by moving things down the splitup of
> the function would create more code duplication than necessery i think.
Note by moving it down, the only gain over keeping it at the top is not
having to check for the BKL...
Anyhow, I would appreciate it if you could give this a try (with kernel
preemption enabled)... any comments are appreciated.
(Note you need a 2.5.35-bk release to get the dump_stack(). Otherwise
use show_trace(0).)
Robert Love
diff -urN linux-2.5.35/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.35/kernel/sched.c Sun Sep 15 22:18:24 2002
+++ linux/kernel/sched.c Tue Sep 17 15:24:08 2002
@@ -940,9 +940,6 @@
struct list_head *queue;
int idx;
- if (unlikely(in_atomic()))
- BUG();
-
#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
#endif
@@ -950,8 +947,20 @@
preempt_disable();
prev = current;
rq = this_rq();
-
release_kernel_lock(prev);
+
+ /*
+ * Test if we are atomic. Since do_exit() needs to call into
+ * schedule() atomically, we ignore that for now. Otherwise,
+ * whine if we are scheduling when we should not be.
+ */
+ if (likely(current->state != TASK_ZOMBIE)) {
+ if (unlikely((preempt_count() & ~PREEMPT_ACTIVE) != 1)) {
+ printk(KERN_ERR "scheduling while non-atomic!\n");
+ dump_stack();
+ }
+ }
+
prev->sleep_timestamp = jiffies;
spin_lock_irq(&rq->lock);
@@ -959,7 +968,7 @@
* if entering off of a kernel preemption go straight
* to picking the next task.
*/
- if (unlikely(preempt_count() & PREEMPT_ACTIVE))
+ if (unlikely(preempt_count() == PREEMPT_ACTIVE))
goto pick_next_task;
switch (prev->state) {
On Tue, 2002-09-17 at 13:23, Robert Love wrote:
> On Tue, 2002-09-17 at 14:57, Ingo Molnar wrote:
>
> > i'd do (a). current->state is to be used anyway, and the default-untaken
> > first branch should be cheap. Plus by moving things down the splitup of
> > the function would create more code duplication than necessery i think.
>
> Note by moving it down, the only gain over keeping it at the top is not
> having to check for the BKL...
>
> Anyhow, I would appreciate it if you could give this a try (with kernel
> preemption enabled)... any comments are appreciated.
>
> (Note you need a 2.5.35-bk release to get the dump_stack(). Otherwise
> use show_trace(0).)
>
> Robert Love
>
I applied that patch to 2.5.35-bk3 and with PREEMPT enabled. And it
booted without any of the usual complaints with preempt and the
in_atomic check. But then, I ran
1) dbench 1 OK
2) dbench 2 OK
3) dbench 3 blam!
Running dbench 3 resulted in the dbench clients hanging and being
unkillable with kill -9 in the D state.
steven 1046 0.0 0.0 1440 472 ? D 13:46 0:00 ./dbench 3
steven 1047 0.0 0.0 1440 420 ? D 13:46 0:00 ./dbench 3
steven 1048 0.0 0.0 1440 472 ? D 13:46 0:00 ./dbench 3
I can ssh and enter my user password, but it hangs after that with
no bash prompt. Other ssh sessions which I started previously are still
responsive.
Test box is 2-way pIII, kernel SMP.
Steven
On Tue, 2002-09-17 at 15:54, Steven Cole wrote:
Thank you for the testing, Steven.
> Running dbench 3 resulted in the dbench clients hanging and being
> unkillable with kill -9 in the D state.
Hrm, I cannot reproduce this. I just successfully completed a `dbench
16'. Can you find where they are hanging? You can get a trace via
sysrq. You can also see where they are in the kernel via the wchan
field of ps: "ps -ewo user,pid,priority,%cpu,stat,command,wchan" is a
favorite of mine.
Sure it does not happen with a stock kernel (no preempt)?
What if you replace the printk() and dump_stack() in schedule() with a
no-op (but not something that will optimize away the conditional, i.e.
try a cpu_relax()).
Oh, is the previous patch fully backed out? None of that do_exit muck
anymore, right?
> Test box is 2-way pIII, kernel SMP.
I too am SMP with kernel preemption, dual Athlon MP.
Regards,
Robert Love
On Tue, 2002-09-17 at 14:06, Robert Love wrote:
> On Tue, 2002-09-17 at 15:54, Steven Cole wrote:
>
> Thank you for the testing, Steven.
>
> > Running dbench 3 resulted in the dbench clients hanging and being
> > unkillable with kill -9 in the D state.
>
> Hrm, I cannot reproduce this. I just successfully completed a `dbench
> 16'. Can you find where they are hanging? You can get a trace via
> sysrq. You can also see where they are in the kernel via the wchan
> field of ps: "ps -ewo user,pid,priority,%cpu,stat,command,wchan" is a
> favorite of mine.
I rebooted the box, but had to SYSRQ-S, SYSRQ-B, due to shutdown -r now
hanging up. On reboot, the init scripts froze after "starting atd", but
fortunately this was after sshd, so I ssh'd to the box several times,
and did your ps command before hanging it with dbench 3 (dbench 1 and 2
were successful again). I have that ps output if you want it. I could
NOT do that, or even ls, after the dbench 3 hang. When I shutdown,
SYSRQ-S only partially synced (just one partition).
>
> Sure it does not happen with a stock kernel (no preempt)?
I'll try 2.5.35-bk3 (no other patches) without preempt shortly.
I'll report any failures, but no news is good news.
>
> What if you replace the printk() and dump_stack() in schedule() with a
> no-op (but not something that will optimize away the conditional, i.e.
> try a cpu_relax()).
I'll try that if I get the time. But I wasn't getting any dump info in
dmesg for those boots.
>
> Oh, is the previous patch fully backed out? None of that do_exit muck
> anymore, right?
Yep, plain 2.5.35, then patched with 2.5.35-bk3, then with your patch
posted at 15:23. Nothing else.
>
> > Test box is 2-way pIII, kernel SMP.
>
> I too am SMP with kernel preemption, dual Athlon MP.
Steven
On Tue, 2002-09-17 at 14:06, Robert Love wrote:
> On Tue, 2002-09-17 at 15:54, Steven Cole wrote:
>
> Thank you for the testing, Steven.
>
> > Running dbench 3 resulted in the dbench clients hanging and being
> > unkillable with kill -9 in the D state.
>
> Hrm, I cannot reproduce this. I just successfully completed a `dbench
> 16'. Can you find where they are hanging? You can get a trace via
> sysrq. You can also see where they are in the kernel via the wchan
> field of ps: "ps -ewo user,pid,priority,%cpu,stat,command,wchan" is a
> favorite of mine.
Sorry, it hung so badly that it didn't respond to that.
>
> Sure it does not happen with a stock kernel (no preempt)?
I just began testing plain vanilla 2.5.35-bk3 without preempt, and the
box has run up to 24 clients so far. So far so good.
>
> What if you replace the printk() and dump_stack() in schedule() with a
> no-op (but not something that will optimize away the conditional, i.e.
> try a cpu_relax()).
I'll try that in a bit.
Steven
On Tue, 2002-09-17 at 16:58, Steven Cole wrote:
> Sorry, it hung so badly that it didn't respond to that.
I fixed the hang. If you notice the problem, please do not laugh.
The attached patch, against 2.5.36, should work fine...
Robert Love
diff -urN linux-2.5.36/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.36/kernel/sched.c Tue Sep 17 20:58:48 2002
+++ linux/kernel/sched.c Wed Sep 18 00:41:09 2002
@@ -940,9 +940,6 @@
struct list_head *queue;
int idx;
- if (unlikely(in_atomic()))
- BUG();
-
#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
#endif
@@ -950,8 +947,20 @@
preempt_disable();
prev = current;
rq = this_rq();
-
release_kernel_lock(prev);
+
+ /*
+ * Test if we are atomic. Since do_exit() needs to call into
+ * schedule() atomically, we ignore that for now. Otherwise,
+ * whine if we are scheduling when we should not be.
+ */
+ if (likely(current->state != TASK_ZOMBIE)) {
+ if (unlikely((preempt_count() & ~PREEMPT_ACTIVE) != 1)) {
+ printk(KERN_ERR "scheduling while non-atomic!\n");
+ dump_stack();
+ }
+ }
+
prev->sleep_timestamp = jiffies;
spin_lock_irq(&rq->lock);
On Tue, 2002-09-17 at 22:44, Robert Love wrote:
> On Tue, 2002-09-17 at 16:58, Steven Cole wrote:
>
> > Sorry, it hung so badly that it didn't respond to that.
>
> I fixed the hang. If you notice the problem, please do not laugh.
>
> The attached patch, against 2.5.36, should work fine...
>
> Robert Love
Thanks. That works fine here. Now that I can run with PREEMPT again,
I'll leave it enabled. It does seem to improve interactive feel under
heavy load, but without an easily identifiable metric for that you know
what Rik would say.
Steven