2003-09-01 15:34:26

by Dale E Martin

[permalink] [raw]
Subject: repeatable, hard lockup on boot in linux-2.6.0-test4

(I'm not subscribed to linux-kernel currently, please Cc: me any replies.)
I recently upgraded from 2.5.75 (which I felt did not have good interactive
response in some circumstances compared to 2.4.21) to 2.6.0-test4. (In
all cases, Linus trees, unpatched.)

However, 2.6.0-test4 locks up hard just after loading the PS/2 mouse stuff.
Unfortunately none the boot messages make it to disk since I don't get far
enough in the boot process for syslog to turn on.

The only change I think could be significant over my 2.5.75 configuration
is that I enabled tagged command queuing for IDA (and I get a message that
it's enabled during boot - I have an IBM disk.)

Here's the output of lspci (after booting with 2.4.21):

00:00.0 Host bridge: Silicon Integrated Systems [SiS] 735 Host (rev 01)
00:01.0 PCI bridge: Silicon Integrated Systems [SiS] SiS 530 Virtual PCI-to-PCI bridge (AGP)
00:02.0 ISA bridge: Silicon Integrated Systems [SiS] 85C503/5513
00:02.2 USB Controller: Silicon Integrated Systems [SiS] SiS7001 USB Controller (rev 07)
00:02.3 USB Controller: Silicon Integrated Systems [SiS] SiS7001 USB Controller (rev 07)
00:02.5 IDE interface: Silicon Integrated Systems [SiS] 5513 [IDE] (rev d0)
00:03.0 Ethernet controller: Silicon Integrated Systems [SiS] SiS900 10/100 Ethernet (rev 90)
00:0b.0 FireWire (IEEE 1394): Texas Instruments TSB12LV23 IEEE-1394 Controller
00:0f.0 Multimedia audio controller: Ensoniq 5880 AudioPCI (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV11DDR [GeForce2 MX 100 DDR/200 DDR] (rev b2)

Here's the output from "hdparm -i", in case this is a TCQ issue:
/dev/hda:
Model=IC35L080AVVA07-0, FwRev=VA4OA52A, SerialNo=VNC402A4L5XTJA
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=52
BuffType=DualPortCache, BuffSize=1863kB, MaxMultSect=16, MultSect=16
CurCHS=4047/16/255, CurSects=16511760, LBA=yes, LBAsects=160836480
IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=yes: disabled (255) WriteCache=enabled
Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: 2 3 4 5

My .config file is attached.

What other info can I provide to help diagnose this issue?

Thanks!
Dale
--
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
[email protected]
http://www.cliftonlabs.com
pgp key available


Attachments:
(No filename) (2.41 kB)
.config (25.74 kB)
Download all attachments

2003-09-01 18:24:05

by Dale E Martin

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

I just removed TCQ from my IDE setup, and I still lock up during boot. The
last message displayed is:
mice: PS/2 mouse device common for all mice

The numlock comes on, and then I'm locked up hard, for instance, I cannot
turn off the numlock at this point.

One thing that I would note is that I don't have anything plugged into my
PS2/2 port since I have a USB mouse. (A Kensington Model# MOSUU B, that
works fine in 2.4.x, FWIW.) Please advise if there is more debugging that
I can try.

Thanks,
Dale
--
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
[email protected]
http://www.cliftonlabs.com
pgp key available

2003-09-01 19:04:12

by Andrew Morton

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

"Dale E Martin" <[email protected]> wrote:
>
> I just removed TCQ from my IDE setup, and I still lock up during boot. The
> last message displayed is:
> mice: PS/2 mouse device common for all mice
>
> The numlock comes on, and then I'm locked up hard, for instance, I cannot
> turn off the numlock at this point.
>
> One thing that I would note is that I don't have anything plugged into my
> PS2/2 port since I have a USB mouse. (A Kensington Model# MOSUU B, that
> works fine in 2.4.x, FWIW.) Please advise if there is more debugging that
> I can try.

Are you able to plug any PS/2 devices into the machine, see if that makes a
difference?

Also it would be useful to add the option "initcall_debug" to the kernel
boot command line. Then you will see a bunch of messages of the form

calling initcall 0xNNNNNNNN

Please look up the final address in the System.map file from the 2.6 build.
This will give us an idea of what the machine was trying to do when it
died.



2003-09-01 19:57:09

by Dale E Martin

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

> Are you able to plug any PS/2 devices into the machine, see if that makes
> a difference?

Yes, plugging in a PS/2 mouse in addition to the USB mouse does allow the
machine to boot up. (Just tried it, thanks for the suggestion.)

> Also it would be useful to add the option "initcall_debug" to the kernel
> boot command line. Then you will see a bunch of messages of the form
>
> calling initcall 0xNNNNNNNN
>
> Please look up the final address in the System.map file from the 2.6 build.
> This will give us an idea of what the machine was trying to do when it
> died.

It was in "i8042_init". Thanks again for the help.

Take care,
Dale
--
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
[email protected]
http://www.cliftonlabs.com
pgp key available

2003-09-01 22:27:59

by Andrew Morton

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

"Dale E Martin" <[email protected]> wrote:
>
> > Are you able to plug any PS/2 devices into the machine, see if that makes
> > a difference?
>
> Yes, plugging in a PS/2 mouse in addition to the USB mouse does allow the
> machine to boot up. (Just tried it, thanks for the suggestion.)

OK, don't go away yet ;)

Could you please add this patch and see how far it gets? If you're keen,
keep adding more DB() statements, narrow it down further?


drivers/input/serio/i8042.c | 23 ++++++++++++++++++-----
include/linux/kernel.h | 1 +
2 files changed, 19 insertions(+), 5 deletions(-)

diff -puN drivers/input/serio/i8042.c~a drivers/input/serio/i8042.c
--- 25/drivers/input/serio/i8042.c~a 2003-09-01 15:24:02.000000000 -0700
+++ 25-akpm/drivers/input/serio/i8042.c 2003-09-01 15:25:31.000000000 -0700
@@ -831,36 +831,49 @@ int __init i8042_init(void)
{
int i;

+ DB();
dbg_init();

+ DB();
if (i8042_platform_init())
return -EBUSY;

i8042_aux_values.irq = I8042_AUX_IRQ;
i8042_kbd_values.irq = I8042_KBD_IRQ;

+ DB();
if (i8042_controller_init())
return -ENODEV;

+ DB();
if (i8042_dumbkbd)
i8042_kbd_port.write = NULL;

- for (i = 0; i < 4; i++)
+ for (i = 0; i < 4; i++) {
+ DB();
i8042_init_mux_values(i8042_mux_values + i, i8042_mux_port + i, i);
+ }

- if (!i8042_nomux && !i8042_check_mux(&i8042_aux_values))
- for (i = 0; i < 4; i++)
+ if (!i8042_nomux && !i8042_check_mux(&i8042_aux_values)) {
+ for (i = 0; i < 4; i++) {
+ DB();
i8042_port_register(i8042_mux_values + i, i8042_mux_port + i);
- else
- if (!i8042_noaux && !i8042_check_aux(&i8042_aux_values))
+ }
+ } else {
+ if (!i8042_noaux && !i8042_check_aux(&i8042_aux_values)) {
+ DB();
i8042_port_register(&i8042_aux_values, &i8042_aux_port);
+ }
+ }

+ DB();
i8042_port_register(&i8042_kbd_values, &i8042_kbd_port);

init_timer(&i8042_timer);
i8042_timer.function = i8042_timer_func;
mod_timer(&i8042_timer, jiffies + I8042_POLL_PERIOD);

+ DB();
register_reboot_notifier(&i8042_notifier);

return 0;
diff -puN include/linux/kernel.h~a include/linux/kernel.h
--- 25/include/linux/kernel.h~a 2003-09-01 15:24:02.000000000 -0700
+++ 25-akpm/include/linux/kernel.h 2003-09-01 15:25:55.000000000 -0700
@@ -236,4 +236,5 @@ extern void BUILD_BUG(void);
#define __FUNCTION__ (__func__)
#endif

+#define DB() printk("%s:%d\n", __FILE__, __LINE__)
#endif

_

2003-09-02 00:31:48

by Dale E Martin

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

> OK, don't go away yet ;)
>
> Could you please add this patch and see how far it gets? If you're keen,
> keep adding more DB() statements, narrow it down further?

It made it to line 853, and the loop ran four times. No other "DB" ever
printed anything, so I guess it locked up inside i8042_init_mux_values when
i = 3.

Thanks again for the help,
Dale
--
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
[email protected]
http://www.cliftonlabs.com
pgp key available

2003-09-02 01:23:25

by Andrew Morton

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

"Dale E Martin" <[email protected]> wrote:
>
> > OK, don't go away yet ;)
> >
> > Could you please add this patch and see how far it gets? If you're keen,
> > keep adding more DB() statements, narrow it down further?
>
> It made it to line 853, and the loop ran four times. No other "DB" ever
> printed anything, so I guess it locked up inside i8042_init_mux_values when
> i = 3.
>

I'd be more suspecting that it got through that loop and locked up in
i8042_check_mux() or i8042_check_aux().

Can you try this one please?


drivers/input/serio/i8042.c | 44 ++++++++++++++++++++++++++++++++++++++------
include/linux/kernel.h | 1 +
2 files changed, 39 insertions(+), 6 deletions(-)

diff -puN drivers/input/serio/i8042.c~i8042-debug drivers/input/serio/i8042.c
--- 25/drivers/input/serio/i8042.c~i8042-debug 2003-09-01 18:17:56.000000000 -0700
+++ 25-akpm/drivers/input/serio/i8042.c 2003-09-01 18:21:53.000000000 -0700
@@ -587,16 +587,19 @@ static int __init i8042_check_mux(struct
* Check if AUX irq is available.
*/

+ DB();
if (request_irq(values->irq, i8042_interrupt, SA_SHIRQ,
"i8042", i8042_request_irq_cookie))
return -1;
+ DB();
free_irq(values->irq, i8042_request_irq_cookie);
-
+ DB();
/*
* Get rid of bytes in the queue.
*/

i8042_flush();
+ DB();

/*
* Internal loopback test - send three bytes, they should come back from the
@@ -607,13 +610,16 @@ static int __init i8042_check_mux(struct
param = 0xf0;
if (i8042_command(&param, I8042_CMD_AUX_LOOP) || param != 0x0f)
return -1;
+ DB();
param = 0x56;
if (i8042_command(&param, I8042_CMD_AUX_LOOP) || param != 0xa9)
return -1;
+ DB();
param = 0xa4;
if (i8042_command(&param, I8042_CMD_AUX_LOOP) || param == 0x5b)
return -1;

+ DB();
printk(KERN_INFO "i8042.c: Detected active multiplexing controller, rev %d.%d.\n",
(~param >> 4) & 0xf, ~param & 0xf);

@@ -630,6 +636,7 @@ static int __init i8042_check_mux(struct
i8042_ctr &= I8042_CTR_AUXDIS;
i8042_ctr &= ~I8042_CTR_AUXINT;

+ DB();
if (i8042_command(&i8042_ctr, I8042_CMD_CTL_WCTR))
return -1;

@@ -638,7 +645,9 @@ static int __init i8042_check_mux(struct
*/

for (i = 0; i < 4; i++) {
+ DB();
i8042_command(&param, I8042_CMD_MUX_PFX + i);
+ DB();
i8042_command(&param, I8042_CMD_AUX_ENABLE);
}

@@ -659,15 +668,18 @@ static int __init i8042_check_aux(struct
* in trying to detect AUX presence.
*/

+ DB();
if (request_irq(values->irq, i8042_interrupt, SA_SHIRQ,
"i8042", i8042_request_irq_cookie))
return -1;
+ DB();
free_irq(values->irq, i8042_request_irq_cookie);

/*
* Get rid of bytes in the queue.
*/

+ DB();
i8042_flush();

/*
@@ -677,6 +689,7 @@ static int __init i8042_check_aux(struct
*/

param = 0x5a;
+ DB();
if (i8042_command(&param, I8042_CMD_AUX_LOOP) || param != 0xa5) {

/*
@@ -687,6 +700,7 @@ static int __init i8042_check_aux(struct
* AUX ports, we test for this only when the LOOP command failed.
*/

+ DB();
if (i8042_command(&param, I8042_CMD_AUX_TEST)
|| (param && param != 0xfa && param != 0xff))
return -1;
@@ -696,13 +710,17 @@ static int __init i8042_check_aux(struct
* Bit assignment test - filters out PS/2 i8042's in AT mode
*/

+ DB();
if (i8042_command(&param, I8042_CMD_AUX_DISABLE))
return -1;
+ DB();
if (i8042_command(&param, I8042_CMD_CTL_RCTR) || (~param & I8042_CTR_AUXDIS))
return -1;

+ DB();
if (i8042_command(&param, I8042_CMD_AUX_ENABLE))
return -1;
+ DB();
if (i8042_command(&param, I8042_CMD_CTL_RCTR) || (param & I8042_CTR_AUXDIS))
return -1;

@@ -713,6 +731,7 @@ static int __init i8042_check_aux(struct
i8042_ctr |= I8042_CTR_AUXDIS;
i8042_ctr &= ~I8042_CTR_AUXINT;

+ DB();
if (i8042_command(&i8042_ctr, I8042_CMD_CTL_WCTR))
return -1;

@@ -831,36 +850,49 @@ int __init i8042_init(void)
{
int i;

+ DB();
dbg_init();

+ DB();
if (i8042_platform_init())
return -EBUSY;

i8042_aux_values.irq = I8042_AUX_IRQ;
i8042_kbd_values.irq = I8042_KBD_IRQ;

+ DB();
if (i8042_controller_init())
return -ENODEV;

+ DB();
if (i8042_dumbkbd)
i8042_kbd_port.write = NULL;

- for (i = 0; i < 4; i++)
+ for (i = 0; i < 4; i++) {
+ DB();
i8042_init_mux_values(i8042_mux_values + i, i8042_mux_port + i, i);
+ }

- if (!i8042_nomux && !i8042_check_mux(&i8042_aux_values))
- for (i = 0; i < 4; i++)
+ if (!i8042_nomux && !i8042_check_mux(&i8042_aux_values)) {
+ for (i = 0; i < 4; i++) {
+ DB();
i8042_port_register(i8042_mux_values + i, i8042_mux_port + i);
- else
- if (!i8042_noaux && !i8042_check_aux(&i8042_aux_values))
+ }
+ } else {
+ if (!i8042_noaux && !i8042_check_aux(&i8042_aux_values)) {
+ DB();
i8042_port_register(&i8042_aux_values, &i8042_aux_port);
+ }
+ }

+ DB();
i8042_port_register(&i8042_kbd_values, &i8042_kbd_port);

init_timer(&i8042_timer);
i8042_timer.function = i8042_timer_func;
mod_timer(&i8042_timer, jiffies + I8042_POLL_PERIOD);

+ DB();
register_reboot_notifier(&i8042_notifier);

return 0;
diff -puN include/linux/kernel.h~i8042-debug include/linux/kernel.h
--- 25/include/linux/kernel.h~i8042-debug 2003-09-01 18:17:56.000000000 -0700
+++ 25-akpm/include/linux/kernel.h 2003-09-01 18:17:56.000000000 -0700
@@ -236,4 +236,5 @@ extern void BUILD_BUG(void);
#define __FUNCTION__ (__func__)
#endif

+#define DB() printk("%s:%d\n", __FILE__, __LINE__)
#endif

_

2003-09-02 12:31:13

by Dale E Martin

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

(I hadn't noticed the call to i8042_check_mux inside the if statement
condition after the loop, doh!)

> Can you try this one please?

Yes, this time it did this loop 4 times:
for (i = 0; i < 4; i++) {
DB();
i8042_init_mux_values(i8042_mux_values + i, i8042_mux_port + i, i);
}

And then I got one more at 590, inside i8042_check_mux:
DB();
if (request_irq(values->irq, i8042_interrupt, SA_SHIRQ,
"i8042", i8042_request_irq_cookie))
return -1;
DB();

This last one did not show up, nor did any inside the conditional loops at
line 878 or 883, nor at line 888. So perhaps it was in request_irq that it
locked up?

I'll be away from this machine for the day, although I can try something
new this evening.

Take care,
Dale
--
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
[email protected]
http://www.cliftonlabs.com
pgp key available

2003-09-02 12:41:54

by Maciej Freudenheim

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4

Dale E Martin wrote:

> However, 2.6.0-test4 locks up hard just after loading the PS/2 mouse stuff.
> Unfortunately none the boot messages make it to disk since I don't get far
> enough in the boot process for syslog to turn on.

> CONFIG_ACPI=y

I had exactly the same problems. However, adding pci=noacpi to kernel
boot command line fixes the problem.

Maciej Freudenheim.

2003-09-02 20:20:28

by Andrew Morton

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

"Dale E Martin" <[email protected]> wrote:
>
> And then I got one more at 590, inside i8042_check_mux:
> DB();
> if (request_irq(values->irq, i8042_interrupt, SA_SHIRQ,
> "i8042", i8042_request_irq_cookie))
> return -1;
> DB();
>
> This last one did not show up, nor did any inside the conditional loops at
> line 878 or 883, nor at line 888. So perhaps it was in request_irq that it
> locked up?

Looks like it. Please add a DB() to the start of i8042_interrupt(),
see if we locked up in an interrupt storm. Also sprinkle some in
request_irq() I guess. You know the deal ;)

There's an ugly in the irq code there: if i8042_check_mux() or
i8042_check_mux() are called while the device is open we end up freeing the
wrong IRQ. It is unlikely to help though.


--- 25/drivers/input/serio/i8042.c~i8042-free_irq-fix Tue Sep 2 12:55:23 2003
+++ 25-akpm/drivers/input/serio/i8042.c Tue Sep 2 12:57:19 2003
@@ -581,6 +581,7 @@ void i8042_controller_cleanup(void)
static int __init i8042_check_mux(struct i8042_values *values)
{
unsigned char param;
+ static int i8042_check_mux_cookie;
int i;

/*
@@ -588,9 +589,9 @@ static int __init i8042_check_mux(struct
*/

if (request_irq(values->irq, i8042_interrupt, SA_SHIRQ,
- "i8042", i8042_request_irq_cookie))
+ "i8042", &i8042_check_mux_cookie))
return -1;
- free_irq(values->irq, i8042_request_irq_cookie);
+ free_irq(values->irq, &i8042_check_mux_cookie);

/*
* Get rid of bytes in the queue.
@@ -653,6 +654,7 @@ static int __init i8042_check_mux(struct
static int __init i8042_check_aux(struct i8042_values *values)
{
unsigned char param;
+ static int i8042_check_aux_cookie;

/*
* Check if AUX irq is available. If it isn't, then there is no point
@@ -660,9 +662,9 @@ static int __init i8042_check_aux(struct
*/

if (request_irq(values->irq, i8042_interrupt, SA_SHIRQ,
- "i8042", i8042_request_irq_cookie))
+ "i8042", &i8042_check_aux_cookie))
return -1;
- free_irq(values->irq, i8042_request_irq_cookie);
+ free_irq(values->irq, &i8042_check_aux_cookie);

/*
* Get rid of bytes in the queue.

_

2003-09-02 23:45:31

by Chris Wright

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

* Andrew Morton ([email protected]) wrote:
> Looks like it. Please add a DB() to the start of i8042_interrupt(),
> see if we locked up in an interrupt storm. Also sprinkle some in
> request_irq() I guess. You know the deal ;)

I have similar issue with:
floppy_init()
floppy_grab_irq_and_dma()
fd_request_irq()
request_irq()

floppy_hardint()
...
floppy_hardint()

This causes interrupt storm, hanging the machine on bootup. Booting with
pci=noacpi fixes this. So, I'm assuming acpi pci irq routing problem.
This is irq 6, which ACPI is disabling. Perhaps disabling leaves it in
a bogus state?

thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net

2003-09-03 12:41:34

by Dale E Martin

[permalink] [raw]
Subject: Re: repeatable, hard lockup on boot in linux-2.6.0-test4 (more details)

> Looks like it. Please add a DB() to the start of i8042_interrupt(),
> see if we locked up in an interrupt storm.

I believe this is what's happening - i8042_interrupt is getting called at a
very fast rate and we never do anything else. (I suppose that's the
definition of an interrupt storm - this is my first foray into kernel
programming aside from fixing compiler errors.)

As others have suggested, booting with "pci=noacpi" fixes the problem.
(Leaving the debug statement in there, I'd estimate that i8042_interrupt
gets called at about 25 times a second - about a screenfull a second or
so.)

So, anyone else with this problem can plug in a PS/2 mouse or boot with
"pci=noacpi" to work around it.

The last thing I would mention is that 2.5.75 ran fine on this machine, so
I guess this problem has crept in since then.

> There's an ugly in the irq code there: if i8042_check_mux() or
> i8042_check_mux() are called while the device is open we end up freeing
> the wrong IRQ. It is unlikely to help though.

I did apply this patch, and it did not visibly help.

Thanks for the help, let me know if you'd like me to try anything else.

Take care,
Dale
--
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
[email protected]
http://www.cliftonlabs.com
pgp key available