2001-12-01 19:21:52

by Ian Morgan

[permalink] [raw]
Subject: in-kernel pcmcia oopsing in SMP

On a few SMP boxes here, with kernels from about 2.4.14 though 2.4.17-pre2,
the system locks up hard during heavy wireless I/O.

Using in-kernel yenta_socket and orinoco drivers. Everything works 100%
with a UP kernel, or an SMP kernel with max_cpus=1, but with 2 cpus, the
system will lock up hard after just a few minutes of I/O.

(Some have said the kernel pcmcia stuff is still immature, but Hinds'
pcmcia-cs package doesn't work at all for me. It's ds.o keeps oopsing when
insmod'ed, so I can't even try it.)

With nmi_watchdog=2, I'm able to get an oops:

ksymoops 2.4.3 on i686 2.4.16-pre1-smp. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.16-pre1-smp/ (default)
-m /usr/src/linux/System.map (default)

Oops: 0000
EIP: 0010:[<c011ea18>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00012800 ebx: 4c15ff51 ecx: c03052d4 edx: ffffffff
esi: 00100000 edi: 4c15ff51 ebp: c02dd060 esp: c179bf28
ds: 0018 es: 0018 ss: 0018
Process (pid: -2147450880, stackpage=c179b000)
Stack: c02dd460 00000001 fffffffe 00100000 c011e7df c02dd460 c02d9800 c02d9810
c02d9810 c179bf74 00000046 c0108efd c0105290 c179a000 c0105290 00000000
00100000 c028df48 80008000 00000000 c023a990 c0105290 c179a000 c179a000
Call Trace: [<c011e7df>] [<c0108efd>] [<c0105290>] [<c0105290>] [<c0105290>]
[<c0105290>] [<c01052bc>] [<c0105322>] [<c0119c28>]
Code: 8b 3f f0 0f ba 6b 04 01 19 c0 85 c0 75 44 8b 43 08 85 c0 75

>>EIP; c011ea18 <tasklet_hi_action+2c/a0> <=====
Trace; c011e7de <do_softirq+6e/cc>
Trace; c0108efc <do_IRQ+1b8/1c8>
Trace; c0105290 <default_idle+0/34>
Trace; c0105290 <default_idle+0/34>
Trace; c0105290 <default_idle+0/34>
Trace; c0105290 <default_idle+0/34>
Trace; c01052bc <default_idle+2c/34>
Trace; c0105322 <cpu_idle+3e/54>
Trace; c0119c28 <release_console_sem+148/150>
Code; c011ea18 <tasklet_hi_action+2c/a0>
00000000 <_EIP>:
Code; c011ea18 <tasklet_hi_action+2c/a0> <=====
0: 8b 3f mov (%edi),%edi <=====
Code; c011ea1a <tasklet_hi_action+2e/a0>
2: f0 0f ba 6b 04 01 lock btsl $0x1,0x4(%ebx)
Code; c011ea20 <tasklet_hi_action+34/a0>
8: 19 c0 sbb %eax,%eax
Code; c011ea22 <tasklet_hi_action+36/a0>
a: 85 c0 test %eax,%eax
Code; c011ea24 <tasklet_hi_action+38/a0>
c: 75 44 jne 52 <_EIP+0x52> c011ea6a <tasklet_hi_action+7e/a0>
Code; c011ea26 <tasklet_hi_action+3a/a0>
e: 8b 43 08 mov 0x8(%ebx),%eax
Code; c011ea28 <tasklet_hi_action+3c/a0>
11: 85 c0 test %eax,%eax
Code; c011ea2a <tasklet_hi_action+3e/a0>
13: 75 00 jne 15 <_EIP+0x15> c011ea2c <tasklet_hi_action+40/a0>


Regards,
Ian Morgan
--
-------------------------------------------------------------------
Ian E. Morgan Vice President & C.O.O. Webcon, Inc.
[email protected] PGP: #2DA40D07 http://www.webcon.net
-------------------------------------------------------------------


2001-12-01 20:06:10

by David Hinds

[permalink] [raw]
Subject: Re: in-kernel pcmcia oopsing in SMP

On Sat, Dec 01, 2001 at 02:21:33PM -0500, Ian Morgan wrote:
> On a few SMP boxes here, with kernels from about 2.4.14 though 2.4.17-pre2,
> the system locks up hard during heavy wireless I/O.

The bug pretty much has to be in the orinoco driver; I'm not sure how
much stress testing it has had on SMP.

> (Some have said the kernel pcmcia stuff is still immature, but Hinds'
> pcmcia-cs package doesn't work at all for me. It's ds.o keeps oopsing when
> insmod'ed, so I can't even try it.)

I did fix one major SMP bug in the pcmcia-cs drivers just a couple
days ago; the beta at http://pcmcia-cs.sourceforge.net/ftp/NEW has the
fix. I'm not sure if it is really the same bug you describe, though,
since no one else has reported the ds module causing an immediate
oops.

The standalone drivers are unlikely to help, though, because the
orinoco_cs driver in the standalone package is virtually identical to
the one in the current 2.4.* kernel.

Actually, though, you could try the (older) wvlan_cs driver in the
pcmcia-cs package. You can do that with your current kernel drivers,
even. Unpack the pcmcia-cs package, do "make config", then cd to the
wireless subdirectory and do a "make" there. That should build a
wvlan_cs module that will mesh with your kernel PCMCIA subsystem. It
will at least give you another data point.

I don't know how to interpret your oops report; you should probably
also forward the bug to David Gibson, [email protected],
since he is the orinoco maintainer.

-- Dave

2001-12-01 20:27:48

by Ian Morgan

[permalink] [raw]
Subject: Re: in-kernel pcmcia oopsing in SMP

On Sat, 1 Dec 2001, David Hinds wrote:

> I did fix one major SMP bug in the pcmcia-cs drivers just a couple
> days ago; the beta at http://pcmcia-cs.sourceforge.net/ftp/NEW has the
> fix. I'm not sure if it is really the same bug you describe, though,
> since no one else has reported the ds module causing an immediate
> oops.

Hmm.. i'll look into that. If it keeps oopsing, i'll send you the dump.

> The standalone drivers are unlikely to help, though, because the
> orinoco_cs driver in the standalone package is virtually identical to
> the one in the current 2.4.* kernel.

True. Actually, they're older. 0.08b seems current.

> Actually, though, you could try the (older) wvlan_cs driver in the
> pcmcia-cs package. You can do that with your current kernel drivers,
> even. Unpack the pcmcia-cs package, do "make config", then cd to the
> wireless subdirectory and do a "make" there. That should build a
> wvlan_cs module that will mesh with your kernel PCMCIA subsystem. It
> will at least give you another data point.

I've tried the wvlan_cs driver. It works for a few inutes, then gets all
messed up and need to be reset. When it's hosed, iwconfig just dumps out a
lot of garbage, making me think the driver is writing all over itself.
Haven't tried this driver in UP mode.

I've also tried Lucent's binary driver, but it doesn't work at all. It will
allow a single packet to be sent then shuts down the tranceiver and changes
to channel 0 (?!) then needs to be reset before another single packet can be
sent, then shuts down the tranceiver again, etc, etc.. (this happens on
several UP and SMP machines).

> I don't know how to interpret your oops report; you should probably
> also forward the bug to David Gibson, [email protected],
> since he is the orinoco maintainer.

Well, Gibson's the one who suggested the broblem was with the pcmcia system,
and not the orinoco driver! Hmm.... can you say runaround?

Basically, before about 2.4.14, the orinoco driver would go haywire and dump
out lots of errors (Gibson is familiar with them) and need to be manually
reset. On more recent kernels however, instead of the driver crapping out
with errors, the system just hard locks.

Regards,
Ian Morgan
--
-------------------------------------------------------------------
Ian E. Morgan Vice President & C.O.O. Webcon, Inc.
[email protected] PGP: #2DA40D07 http://www.webcon.net
-------------------------------------------------------------------


2001-12-01 20:47:01

by David Hinds

[permalink] [raw]
Subject: Re: in-kernel pcmcia oopsing in SMP

On Sat, Dec 01, 2001 at 03:27:24PM -0500, Ian Morgan wrote:
>
> > I don't know how to interpret your oops report; you should probably
> > also forward the bug to David Gibson, [email protected],
> > since he is the orinoco maintainer.
>
> Well, Gibson's the one who suggested the broblem was with the pcmcia system,
> and not the orinoco driver! Hmm.... can you say runaround?

It pretty much can't be a PCMCIA subsystem bug. The basic PCMCIA code
handles card identification and configuration of the socket; however,
for almost all cards, the PCMCIA subsystem is completely out of the
loop during normal card operation. No PCMCIA code outside of the
orinoco driver itself will ever be executed.

Your oops, in tasklet code, sounds to me like a locking bug in the
driver code for managing the transmit stack vs. interrupt handling.
Have there been reports of the driver working well on SMP boxes?

-- Dave

2001-12-03 08:51:53

by Ian Morgan

[permalink] [raw]
Subject: BUG() in spinlock.h loading ds.o

On Sat, 1 Dec 2001, David Hinds wrote:

> On Sat, Dec 01, 2001 at 02:21:33PM -0500, Ian Morgan wrote:
>
> > (Some have said the kernel pcmcia stuff is still immature, but Hinds'
> > pcmcia-cs package doesn't work at all for me. It's ds.o keeps oopsing when
> > insmod'ed, so I can't even try it.)
>
> I did fix one major SMP bug in the pcmcia-cs drivers just a couple
> days ago; the beta at http://pcmcia-cs.sourceforge.net/ftp/NEW has the
> fix. I'm not sure if it is really the same bug you describe, though,
> since no one else has reported the ds module causing an immediate
> oops.

Well, I've tried the new 30-Nov-01 package, but ds.o still keeps causing
oopses consistently, whether in UP or SMP. I've also turned on kernel BUG()
reporting, which seems to indicate a problem in spinlock.h. Hare are a
couple sample oopses:

ksymoops 2.4.3 on i686 2.4.17-pre2. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.17-pre2/ (default)
-m /usr/src/linux/System.map (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

eip: c0114c40
kernel BUG at /usr/src/linux-2.4.17-pre2/include/asm/spinlock.h:133!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c0114c6d>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010086
eax: 0000004b ebx: 00000001 ecx: c028d1ac edx: 0000300a
esi: 00000004 edi: d5ae6260 ebp: c029fea0 esp: c029fe6c
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c029f000)
Stack: c0241060 00000085 00000000 00000004 d5ae6260 00000008 d7a212f0 00000002
00000001 d7ff1f24 00000286 00000001 d5ae6274 c029feb4 d8bae3a6 d531dac0
00000004 00000000 c029fec4 d8bae4a8 d5ae6260 00000004 c029fee4 d8b99766
Call Trace: [<d8bae3a6>] [<d8bae4a8>] [<d8b99766>] [<d8b996b2>] [<d8b99624>]
[<c0121bbb>] [<c0121ca0>] [<c011dee2>] [<c011ddb9>] [<c011db3f>] [<c0108efd>]
[<c0105290>] [<c0105290>] [<c0105290>] [<c0105290>] [<c01052bc>] [<c0105322>]
[<c0105000>] [<c010509a>]
Code: 0f 0b 83 c4 08 8b 55 fc f0 fe 0a 0f 88 04 c2 11 00 89 5d f0

>>EIP; c0114c6c <__wake_up+50/1c8> <=====
Trace; d8bae3a6 <.bss.end+3cd4/????>
Trace; d8bae4a8 <.bss.end+3dd6/????>
Trace; d8b99766 <[pcmcia_core].bss.end+f4e8/1ad82>
Trace; d8b996b2 <[pcmcia_core].bss.end+f434/1ad82>
Trace; d8b99624 <[pcmcia_core].bss.end+f3a6/1ad82>
Trace; c0121bba <timer_bh+2fe/3c4>
Trace; c0121ca0 <do_timer+20/50>
Trace; c011dee2 <bh_action+4e/108>
Trace; c011ddb8 <tasklet_hi_action+6c/a0>
Trace; c011db3e <do_softirq+6e/cc>
Trace; c0108efc <do_IRQ+1b8/1c8>
Trace; c0105290 <default_idle+0/34>
Trace; c0105290 <default_idle+0/34>
Trace; c0105290 <default_idle+0/34>
Trace; c0105290 <default_idle+0/34>
Trace; c01052bc <default_idle+2c/34>
Trace; c0105322 <cpu_idle+3e/54>
Trace; c0105000 <_stext+0/0>
Trace; c010509a <rest_init+9a/9c>
Code; c0114c6c <__wake_up+50/1c8>
00000000 <_EIP>:
Code; c0114c6c <__wake_up+50/1c8> <=====
0: 0f 0b ud2a <=====
Code; c0114c6e <__wake_up+52/1c8>
2: 83 c4 08 add $0x8,%esp
Code; c0114c70 <__wake_up+54/1c8>
5: 8b 55 fc mov 0xfffffffc(%ebp),%edx
Code; c0114c74 <__wake_up+58/1c8>
8: f0 fe 0a lock decb (%edx)
Code; c0114c76 <__wake_up+5a/1c8>
b: 0f 88 04 c2 11 00 js 11c215 <_EIP+0x11c215> c0230e80 <stext_lock+5f4/68b2>
Code; c0114c7c <__wake_up+60/1c8>
11: 89 5d f0 mov %ebx,0xfffffff0(%ebp)

<0>Kernel Panic: Aiee, killing interrupt handler!

1 warning issued. Results may not be reliable.


ksymoops 2.4.3 on i686 2.4.17-pre2. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.17-pre2/ (default)
-m /usr/src/linux/System.map (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

eip: c0114c40
kernel BUG at /usr/src/linux-2.4.17-pre2/include/asm/spinlock.h:133!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c0114c6d>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010082
eax: 0000004b ebx: 00000001 ecx: c028d1ac edx: 00003064
esi: 00000004 edi: d5f16220 ebp: c179be8c esp: c179be58
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c179b000)
Stack: c0241060 00000085 00000000 00000004 d5f16260 c1788000 c1789fe8 c02db800
c01217d4 c179a000 00000282 00000001 d5f16234 c179bea0 d8bae3a6 d6339160
00000004 00000000 c179beb0 d8bae4a8 d5f16220 00000004 c179bed0 d8b80766
Call Trace: [<c01217d4>] [<d8bae3a6>] [<d8bae4a8>] [<d8b80766>] [<d8b806b2>]
[<d8b80624>] [<c0121bbb>] [<c0121ca0>] [<c011dee2>] [<c011ddb9>] [<c011db3f>]
[<c0108efd>] [<c0105290>] [<c0105290>] [<c0105290>] [<c0105290>] [<c01052bc>]
[<c0105322>] [<c0118f8a>]
Code: 0f 0b 83 c4 08 8b 55 fc f0 fe 0a 0f 88 04 c2 11 00 89 5d f0

>>EIP; c0114c6c <__wake_up+50/1c8> <=====
Trace; c01217d4 <update_process_times+20/94>
Trace; d8bae3a6 <.bss.end+3cd4/????>
Trace; d8bae4a8 <.bss.end+3dd6/????>
Trace; d8b80766 <[pcmcia_core]send_event+32/4c>
Trace; d8b806b2 <[pcmcia_core]unreset_socket+8e/110>
Trace; d8b80624 <[pcmcia_core]unreset_socket+0/110>
Trace; c0121bba <timer_bh+2fe/3c4>
Trace; c0121ca0 <do_timer+20/50>
Trace; c011dee2 <bh_action+4e/108>
Trace; c011ddb8 <tasklet_hi_action+6c/a0>
Trace; c011db3e <do_softirq+6e/cc>
Trace; c0108efc <do_IRQ+1b8/1c8>
Trace; c0105290 <default_idle+0/34>
Trace; c0105290 <default_idle+0/34>
Trace; c0105290 <default_idle+0/34>
Trace; c0105290 <default_idle+0/34>
Trace; c01052bc <default_idle+2c/34>
Trace; c0105322 <cpu_idle+3e/54>
Trace; c0118f8a <release_console_sem+14a/150>
Code; c0114c6c <__wake_up+50/1c8>
00000000 <_EIP>:
Code; c0114c6c <__wake_up+50/1c8> <=====
0: 0f 0b ud2a <=====
Code; c0114c6e <__wake_up+52/1c8>
2: 83 c4 08 add $0x8,%esp
Code; c0114c70 <__wake_up+54/1c8>
5: 8b 55 fc mov 0xfffffffc(%ebp),%edx
Code; c0114c74 <__wake_up+58/1c8>
8: f0 fe 0a lock decb (%edx)
Code; c0114c76 <__wake_up+5a/1c8>
b: 0f 88 04 c2 11 00 js 11c215 <_EIP+0x11c215> c0230e80 <stext_lock+5f4/68b2>
Code; c0114c7c <__wake_up+60/1c8>
11: 89 5d f0 mov %ebx,0xfffffff0(%ebp)

<0>Kernel Panic: Aiee, killing interrupt handler!

1 warning issued. Results may not be reliable.


Regards,
Ian Morgan
--
-------------------------------------------------------------------
Ian E. Morgan Vice President & C.O.O. Webcon, Inc.
[email protected] PGP: #2DA40D07 http://www.webcon.net
-------------------------------------------------------------------

2001-12-03 08:51:49

by David Hinds

[permalink] [raw]
Subject: Re: BUG() in spinlock.h loading ds.o

On Sun, Dec 02, 2001 at 10:25:56PM -0500, Ian Morgan wrote:
>
> Well, I've tried the new 30-Nov-01 package, but ds.o still keeps causing
> oopses consistently, whether in UP or SMP. I've also turned on kernel BUG()
> reporting, which seems to indicate a problem in spinlock.h. Hare are a
> couple sample oopses:

Oh. Hmmm. The problem is that the PCMCIA package doesn't know about
the spinlock debugging option, so it mis-sized the spinlock data
structure.

I can modify the PCMCIA Configure script to process this option. Of
course this doesn't address your main problem with the orinoco driver.

-- Dave

2001-12-17 03:40:23

by David Gibson

[permalink] [raw]
Subject: Re: in-kernel pcmcia oopsing in SMP

On Sat, Dec 01, 2001 at 12:46:30PM -0800, David Hinds wrote:
> On Sat, Dec 01, 2001 at 03:27:24PM -0500, Ian Morgan wrote:
> >
> > > I don't know how to interpret your oops report; you should probably
> > > also forward the bug to David Gibson, [email protected],
> > > since he is the orinoco maintainer.
> >
> > Well, Gibson's the one who suggested the broblem was with the pcmcia system,
> > and not the orinoco driver! Hmm.... can you say runaround?

Look, I'm not paid to do tech support for you, so there is nothing for
me to gain in trying to give you the runaround. The orinoco driver is
designed to make hard hangs very unlikely, even at the expense of a
greater chance of the driver operation falling over, so that was by
best initial guess at the problem - albeit possibly a hurried and
inaccurate one (see below).

> It pretty much can't be a PCMCIA subsystem bug. The basic PCMCIA code
> handles card identification and configuration of the socket; however,
> for almost all cards, the PCMCIA subsystem is completely out of the
> loop during normal card operation. No PCMCIA code outside of the
> orinoco driver itself will ever be executed.

Hmm... yes, I suppose so. How odd.

> Your oops, in tasklet code, sounds to me like a locking bug in the
> driver code for managing the transmit stack vs. interrupt handling.
> Have there been reports of the driver working well on SMP boxes?

Well, one of the main features of the driver is that the Tx path and
the interupt handler (Rx path) are permitted to run concurrently.
This is an issue even on UP (although not as complex), since the Rx
patch can interrupt the Tx path. I believe there has been at least
some successful operation on SMP machines, but unfortunately I don't
know any details.

--
David Gibson | For every complex problem there is a
[email protected] | solution which is simple, neat and
| wrong. -- H.L. Mencken
http://www.ozlabs.org/people/dgibson

2001-12-17 05:00:25

by David Hinds

[permalink] [raw]
Subject: Re: in-kernel pcmcia oopsing in SMP

On Mon, Dec 17, 2001 at 02:24:00PM +1100, David Gibson wrote:
>
> > Your oops, in tasklet code, sounds to me like a locking bug in the
> > driver code for managing the transmit stack vs. interrupt handling.
> > Have there been reports of the driver working well on SMP boxes?
>
> Well, one of the main features of the driver is that the Tx path and
> the interupt handler (Rx path) are permitted to run concurrently.
> This is an issue even on UP (although not as complex), since the Rx
> patch can interrupt the Tx path. I believe there has been at least
> some successful operation on SMP machines, but unfortunately I don't
> know any details.

Yes, after I wrote that, I looked at the orinoco code, and the tx and
interrupt paths looked pretty straightforward. But I don't think I
would be able to catch anything that wasn't really obvious.

-- Dave