2005-10-12 17:00:40

by Mark Knecht

[permalink] [raw]
Subject: 2.6.14-rc4-rt1 - enable IRQ-off tracing causes kernel to fault at boot

Config file attached. The only change was to enable IRQ-off latency
tracing using make menuconfig. Rebuild and reboot. I got a message
about the kernel not syncing, lots of stuff above that message about
do_futex, etc.

This was not a problem on 2.6.14-rc3-rt13.

Thanks,
Mark


Attachments:
(No filename) (277.00 B)
knecht.config-2.6.14-rc4-rt1.bz2 (6.23 kB)
Download all attachments

2005-10-13 07:29:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.14-rc4-rt1 - enable IRQ-off tracing causes kernel to fault at boot


* Mark Knecht <[email protected]> wrote:

> Config file attached. The only change was to enable IRQ-off latency
> tracing using make menuconfig. Rebuild and reboot. I got a message
> about the kernel not syncing, lots of stuff above that message about
> do_futex, etc.

i cannot reproduce your problems - your .config works fine on my x64
box. A log of the crash would be needed - do you have a null-modem cable
to connect this box to some other nearby box to do serial logging? If
yes then there is a mini-howto below. (for x86, but it works the same
for x64)

Ingo


to set up serial logging:
-------------------------

install a null modem cable (== serial cable) to one of the serial ports
of the server, connect the cable to another box, run a terminal program
on that other box (e.g. "minicom -m" - do Alt-L to switch on logging
after starting it up) and set up the server's kernel to do serial
logging: enable CONFIG_SERIAL_8250_CONSOLE and
CONFIG_SERIAL_CORE_CONSOLE, recompile & reinstall the kernel, add
"console=ttyS0,38400 console=tty0" to your /etc/grub.conf or
/etc/lilo.conf kernel boot line, reboot the server with the new kernel
command line - and configure minicom to run with that speed (Alt-S).

e.g. my /etc/grub.conf has:

title test-2.6 (test-2.6)
root (hd0,0)
kernel /boot/bzImage root=/dev/sda1 console=ttyS0,38400 console=tty0 nmi_watchdog=1 kernel_preempt=1

if everything is set up correctly then you should see kernel messages
showing up in the minicom session when you boot up.

When the messages do not show up then typical errors are mismatch
between the serial port (or speed) and the device names used - if it's
COM2 then use ttyS1, and dont forget to set up the serial speed option
of minicom, etc. You can test the serial connection by doing:

echo x > /dev/ttyS0

and that should show up in the minicom session on the other box.

to set up early-printk:
-----------------------

occasionally lockups/crashes happen so early in the bootup that nothing
makes it even to the serial log. In that case the 'earlyprintk' feature
is most useful. It is default-enabled on all 2.6 kernels, you only need
to add one more boot parameter to activate it over the serial console:

earlyprintk=serial,ttyS0,38400

2005-10-13 12:26:19

by Mark Knecht

[permalink] [raw]
Subject: Re: 2.6.14-rc4-rt1 - enable IRQ-off tracing causes kernel to fault at boot

On 10/13/05, Ingo Molnar <[email protected]> wrote:
>
> * Mark Knecht <[email protected]> wrote:
>
> > Config file attached. The only change was to enable IRQ-off latency
> > tracing using make menuconfig. Rebuild and reboot. I got a message
> > about the kernel not syncing, lots of stuff above that message about
> > do_futex, etc.
>
> i cannot reproduce your problems - your .config works fine on my x64
> box. A log of the crash would be needed - do you have a null-modem cable
> to connect this box to some other nearby box to do serial logging? If
> yes then there is a mini-howto below. (for x86, but it works the same
> for x64)
>
> Ingo
>
>
Guitar player doesn't do serial ports? ;-)

OK, I've never done anything like this before, but I'm motivated so
I'll give it a shot. Hopefully I can make some headway without having
to ask too many stupid questions, such as the one that follows:

Question: Is a 'null modem' cable just a plain serial cable, or is it
a special serial cable I need to go buy?

Cheers,
Mark

2005-10-13 12:41:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rc4-rt1 - enable IRQ-off tracing causes kernel to fault at boot


On Thu, 13 Oct 2005, Mark Knecht wrote:

> Guitar player doesn't do serial ports? ;-)
>
> OK, I've never done anything like this before, but I'm motivated so
> I'll give it a shot. Hopefully I can make some headway without having
> to ask too many stupid questions, such as the one that follows:
>
> Question: Is a 'null modem' cable just a plain serial cable, or is it
> a special serial cable I need to go buy?
>

There's two types of "plain serial cables". A 'null modem' and a 'non
null modem'. The 'null modem' cable allows for two computers connected
together by their serial ports to talk to each other. This is because the
'null modem' cable crosses the wires so input connects to output and so
forth. A 'non-null modem' cable is a direct straight through. So you
can connect multiple 'non-null' modem cables together and it will still be
a non-null modem cable. In fact you can connect one or more non-null to a
null modem cable and it will still be a null modem cable (just longer).
So make sure that you get a 'null modem' cable.

Make sure you have the CONFIG_SERIAL_CONSOLE (or
CONFIG_SERIAL_8250_CONSOLE) set. And then read
Documentation/serial-console.txt


-- Steve

2005-10-13 12:47:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rc4-rt1 - enable IRQ-off tracing causes kernel to fault at boot


On Thu, 13 Oct 2005, Mark Knecht wrote:

>
> Question: Is a 'null modem' cable just a plain serial cable, or is it
> a special serial cable I need to go buy?
>

Also make sure that you get a cable with two female ends. If the computer
guy has half a clue, asking for a 'null modem cable' with two female ends
to connect two computers together. Should be enough info for him to get
you the right cable. Although he might ask "9 pin or 25?". Which you
would say 9. :-)

-- Steve

2005-10-13 19:10:33

by Mark Knecht

[permalink] [raw]
Subject: Re: 2.6.14-rc4-rt1 - enable IRQ-off tracing causes kernel to fault at boot

On 10/13/05, Steven Rostedt <[email protected]> wrote:
>
> On Thu, 13 Oct 2005, Mark Knecht wrote:
>
> >
> > Question: Is a 'null modem' cable just a plain serial cable, or is it
> > a special serial cable I need to go buy?
> >
>
> Also make sure that you get a cable with two female ends. If the computer
> guy has half a clue, asking for a 'null modem cable' with two female ends
> to connect two computers together. Should be enough info for him to get
> you the right cable. Although he might ask "9 pin or 25?". Which you
> would say 9. :-)
>
> -- Steve
>
Ingo & Steve,
Thank you for your great instructions that even a guitar player
could basically follow. After about an hour of messing around I did
manage to capture the crash. The console file is attached.

NOTE: The first time I booted the kernel it got to the crash point and
the machine rebooted. The second time it booted I got the trace. Both
boots are in the capture file.

Hope this helps!

Cheers,
Mark


Attachments:
(No filename) (981.00 B)
minicom.cap.bz2 (10.62 kB)
Download all attachments

2005-10-14 03:52:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.14-rc4-rt1 - enable IRQ-off tracing causes kernel to fault at boot


* Mark Knecht <[email protected]> wrote:

> Ingo & Steve,
> Thank you for your great instructions that even a guitar player
> could basically follow. After about an hour of messing around I did
> manage to capture the crash. The console file is attached.
>
> NOTE: The first time I booted the kernel it got to the crash point and
> the machine rebooted. The second time it booted I got the trace. Both
> boots are in the capture file.

thanks, this log is much more informative. No smoking gun though, but it
seems something fundamental (probably lowlevel x64 code) has been broken
by -rt1.

Do the crashes go away if you take the -rc3-rt13 version of
arch/x86_64/kernel/entry.S and copy it over into the -rc4-rt1 tree?
[this undoes a particular set of CONFIG_CRITICAL_IRQSOFF_TIMING fixes
from the x64 code, which i did during -rc3-rt13 => -rc4-rt1]

(Note that doing this will re-introduce tracing bugs, which can result
in false-positive latency readings - but it should fix any related
lowlevel bug in the assembly code.)

if this indeed solves the crash then i'd suggest to restore the -rt1
version of entry.S, and i'd suggest to disable CRITICAL_IRQSOFF_TIMING
until i fix it. You should be able to get pretty good latency tracing
info even without CRITICAL_IRQSOFF_TIMING.

Ingo

2005-10-14 14:56:10

by Mark Knecht

[permalink] [raw]
Subject: Re: 2.6.14-rc4-rt1 - enable IRQ-off tracing causes kernel to fault at boot

On 10/13/05, Ingo Molnar <[email protected]> wrote:
>
> * Mark Knecht <[email protected]> wrote:
>
> > Ingo & Steve,
> > Thank you for your great instructions that even a guitar player
> > could basically follow. After about an hour of messing around I did
> > manage to capture the crash. The console file is attached.
> >
> > NOTE: The first time I booted the kernel it got to the crash point and
> > the machine rebooted. The second time it booted I got the trace. Both
> > boots are in the capture file.
>
> thanks, this log is much more informative. No smoking gun though, but it
> seems something fundamental (probably lowlevel x64 code) has been broken
> by -rt1.
>
> Do the crashes go away if you take the -rc3-rt13 version of
> arch/x86_64/kernel/entry.S and copy it over into the -rc4-rt1 tree?
> [this undoes a particular set of CONFIG_CRITICAL_IRQSOFF_TIMING fixes
> from the x64 code, which i did during -rc3-rt13 => -rc4-rt1]

Indeed it is fixed by doing this. Options are on but the modified
kernel does boot:

*****************************************************************************
* *
* REMINDER, the following debugging options are turned on in your .config: *
* *
* CONFIG_DEBUG_PREEMPT *
* CONFIG_CRITICAL_PREEMPT_TIMING *
* CONFIG_CRITICAL_IRQSOFF_TIMING *
* *
* they may increase runtime overhead and latencies. *
* *
*****************************************************************************

mark@lightning ~ $ uname -a
Linux lightning 2.6.14-rc4-rt1 #8 PREEMPT Fri Oct 14 07:46:29 PDT 2005
x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux
mark@lightning ~ $


>
> (Note that doing this will re-introduce tracing bugs, which can result
> in false-positive latency readings - but it should fix any related
> lowlevel bug in the assembly code.)
>
> if this indeed solves the crash then i'd suggest to restore the -rt1
> version of entry.S, and i'd suggest to disable CRITICAL_IRQSOFF_TIMING
> until i fix it. You should be able to get pretty good latency tracing
> info even without CRITICAL_IRQSOFF_TIMING.
>
> Ingo
>

Will got back to the original entry.S file with the IRQoff option
turned off. Let me know when you have a fix to test, or if you need
more data.

Thanks,
Mark

2005-10-14 19:41:04

by Mark Knecht

[permalink] [raw]
Subject: Re: 2.6.14-rc4-rt1 - enable IRQ-off tracing causes kernel to fault at boot

On 10/14/05, Mark Knecht <[email protected]> wrote:
> On 10/13/05, Ingo Molnar <[email protected]> wrote:
> >
> > * Mark Knecht <[email protected]> wrote:
> >
> > > Ingo & Steve,
> > > Thank you for your great instructions that even a guitar player
> > > could basically follow. After about an hour of messing around I did
> > > manage to capture the crash. The console file is attached.
> > >
> > > NOTE: The first time I booted the kernel it got to the crash point and
> > > the machine rebooted. The second time it booted I got the trace. Both
> > > boots are in the capture file.
> >
> > thanks, this log is much more informative. No smoking gun though, but it
> > seems something fundamental (probably lowlevel x64 code) has been broken
> > by -rt1.
> >
> > Do the crashes go away if you take the -rc3-rt13 version of
> > arch/x86_64/kernel/entry.S and copy it over into the -rc4-rt1 tree?
> > [this undoes a particular set of CONFIG_CRITICAL_IRQSOFF_TIMING fixes
> > from the x64 code, which i did during -rc3-rt13 => -rc4-rt1]
>
> Indeed it is fixed by doing this. Options are on but the modified
> kernel does boot:

Ingo,
Good news and strange news:

GOOD NEWS: This problem may or may not be fixed in 2.6.14-rc4-rt5. The
kernel now boots for me with IRQoff tracing enabled, but once again
I'm seeing times that appear to be the timer frequency:

( softirq-timer/0-3 |#0): new 998 us maximum-latency critical section.
=> started at timestamp 403325647: <acpi_processor_idle+0x20/0x379>
=> ended at timestamp 403326645: <thread_return+0xb5/0x11a>

Call Trace:<ffffffff8014e58c>{check_critical_timing+492}
<ffffffff8014e9e5>{sub_preempt_count_ti+133}
<ffffffff803edaec>{thread_return+70}
<ffffffff803edb5b>{thread_return+181}
<ffffffff803edcc5>{schedule+261} <ffffffff8013723a>{ksoftirqd+138}
<ffffffff801371b0>{ksoftirqd+0} <ffffffff8014757d>{kthread+205}
<ffffffff8010e70e>{child_rip+8} <ffffffff801474b0>{kthread+0}
<ffffffff8010e706>{child_rip+0}
---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<ffffffff803ed5b8>] .... __schedule+0xb8/0x5a6
.....[<00000000>] .. ( <= _stext+0x7feff0e8/0xe8)

=> dump-end timestamp 403326728

lightning ~ #

STRANGE NEWS: It now takes between 1-2 minutes for Gnome to log out.

I think possibly something is not exactly right about the timers yet.

- Mark