2007-05-03 20:31:46

by Lennart Sorensen

[permalink] [raw]
Subject: Strange soft lockup detected message

I have had this happen a few times recently and was wondering if anyone
has an idea what could be going on:

BUG: soft lockup detected on CPU#0!
[<c0103fc4>] dump_stack+0x24/0x30
[<c013d71e>] softlockup_tick+0x7e/0xc0
[<c011eb23>] update_process_times+0x33/0x80
[<c01062c9>] timer_interrupt+0x39/0x80
[<c013daad>] handle_IRQ_event+0x3d/0x70
[<c013de09>] __do_IRQ+0xa9/0x150
[<c0104e55>] do_IRQ+0x25/0x60
[<c010313a>] common_interrupt+0x1a/0x20
[<d084e00c>] pcnet32_dwio_read_csr+0xc/0x20 [pcnet32]
[<d084e9d2>] pcnet32_interrupt+0x42/0x2b0 [pcnet32]
[<c013daad>] handle_IRQ_event+0x3d/0x70
[<c013de09>] __do_IRQ+0xa9/0x150
[<c0104e55>] do_IRQ+0x25/0x60
[<c010313a>] common_interrupt+0x1a/0x20
[<c013da88>] handle_IRQ_event+0x18/0x70
[<c013de09>] __do_IRQ+0xa9/0x150
[<c0104e55>] do_IRQ+0x25/0x60
[<c010313a>] common_interrupt+0x1a/0x20
[<00005791>] 0x5791

This is on a system running a Geode LX at 500MHz, using 2.6.18 based
kernel (specifically a slightly modified debian 4.0 Etch kernel).

I am really wondering where do I go looking for the cause of this. The
same kernel running on a Geode SC1200 (GX1) does not appear to do this.

If I knew what the error meant I would have a better idea how to debug
it and fix it.

--
Len Sorensen


2007-05-04 14:10:27

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Thu, May 03, 2007 at 04:31:43PM -0400, Lennart Sorensen wrote:
> I have had this happen a few times recently and was wondering if anyone
> has an idea what could be going on:
>
> BUG: soft lockup detected on CPU#0!
> [<c0103fc4>] dump_stack+0x24/0x30
> [<c013d71e>] softlockup_tick+0x7e/0xc0
> [<c011eb23>] update_process_times+0x33/0x80
> [<c01062c9>] timer_interrupt+0x39/0x80
> [<c013daad>] handle_IRQ_event+0x3d/0x70
> [<c013de09>] __do_IRQ+0xa9/0x150
> [<c0104e55>] do_IRQ+0x25/0x60
> [<c010313a>] common_interrupt+0x1a/0x20
> [<d084e00c>] pcnet32_dwio_read_csr+0xc/0x20 [pcnet32]
> [<d084e9d2>] pcnet32_interrupt+0x42/0x2b0 [pcnet32]
> [<c013daad>] handle_IRQ_event+0x3d/0x70
> [<c013de09>] __do_IRQ+0xa9/0x150
> [<c0104e55>] do_IRQ+0x25/0x60
> [<c010313a>] common_interrupt+0x1a/0x20
> [<c013da88>] handle_IRQ_event+0x18/0x70
> [<c013de09>] __do_IRQ+0xa9/0x150
> [<c0104e55>] do_IRQ+0x25/0x60
> [<c010313a>] common_interrupt+0x1a/0x20
> [<00005791>] 0x5791
>
> This is on a system running a Geode LX at 500MHz, using 2.6.18 based
> kernel (specifically a slightly modified debian 4.0 Etch kernel).
>
> I am really wondering where do I go looking for the cause of this. The
> same kernel running on a Geode SC1200 (GX1) does not appear to do this.
>
> If I knew what the error meant I would have a better idea how to debug
> it and fix it.

I looked at the pcnet32_interrupt function and where it calls
pcnet32_dwio_read_csr and saw this:

2550 /* The PCNET32 interrupt handler. */
2551 static irqreturn_t
2552 pcnet32_interrupt(int irq, void *dev_id)
2553 {
2554 struct net_device *dev = dev_id;
2555 struct pcnet32_private *lp;
2556 unsigned long ioaddr;
2557 u16 csr0;
2558 int boguscnt = max_interrupt_work;
2559
2560 ioaddr = dev->base_addr;
2561 lp = netdev_priv(dev);
2562
2563 spin_lock(&lp->lock);
2564
2565 csr0 = lp->a.read_csr(ioaddr, CSR0);
2566 while ((csr0 & 0x8f00) && --boguscnt >= 0) {
2567 if (csr0 == 0xffff) {
2568 break; /* PCMCIA remove happened */

So I wonder, what happens if an interrupt occours, and since one of the
devices on that interrupt is the pcnet32 so it grabs the port lock, goes
to read CSR0, and then another interrupt occours on the same IRQ line
(I run with PREEMPT enabled if that matters) and the pcnet32 interrupt
handler is called again but since the port is already locked it has to
wait, causing the cpu to be locked up.

Should line 2563 be a spin_lock_irqsave instead along with the
appropriate unluck later?

--
Len Sorensen

2007-05-04 14:34:28

by Frederik Deweerdt

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Fri, May 04, 2007 at 10:10:24AM -0400, Lennart Sorensen wrote:
> On Thu, May 03, 2007 at 04:31:43PM -0400, Lennart Sorensen wrote:
[...]
> Should line 2563 be a spin_lock_irqsave instead along with the
> appropriate unluck later?
IIRC, when you enable lockdep, it will complain about spinlocks used in
an invalid context.

Regards,
Frederik
>
> --
> Len Sorensen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-05-04 15:19:36

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Fri, May 04, 2007 at 04:33:26PM +0200, Frederik Deweerdt wrote:
> On Fri, May 04, 2007 at 10:10:24AM -0400, Lennart Sorensen wrote:
> > On Thu, May 03, 2007 at 04:31:43PM -0400, Lennart Sorensen wrote:
> [...]
> > Should line 2563 be a spin_lock_irqsave instead along with the
> > appropriate unluck later?
> IIRC, when you enable lockdep, it will complain about spinlocks used in
> an invalid context.

What is lockdep and how do I enable it?

I enabled SPINLOCK_DEBUG and am going to try that kernel now (except it
hit the bug before I could even log in and install the kernel this time,
so another reboot first).

--
Len Sorensen

2007-05-04 15:35:43

by Frederik Deweerdt

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Fri, May 04, 2007 at 11:19:34AM -0400, Lennart Sorensen wrote:
> On Fri, May 04, 2007 at 04:33:26PM +0200, Frederik Deweerdt wrote:
> > On Fri, May 04, 2007 at 10:10:24AM -0400, Lennart Sorensen wrote:
> > > On Thu, May 03, 2007 at 04:31:43PM -0400, Lennart Sorensen wrote:
> > [...]
> > > Should line 2563 be a spin_lock_irqsave instead along with the
> > > appropriate unluck later?
> > IIRC, when you enable lockdep, it will complain about spinlocks used in
> > an invalid context.
>
> What is lockdep and how do I enable it?
For the "what" part, see Documentation/lockdep-design.txt. You'll enable
it by with SPINLOCK_DEBUG, indeed.
>
> I enabled SPINLOCK_DEBUG and am going to try that kernel now (except it
> hit the bug before I could even log in and install the kernel this time,
> so another reboot first).
>
> --
> Len Sorensen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-05-04 15:40:12

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Fri, May 04, 2007 at 05:34:38PM +0200, Frederik Deweerdt wrote:
> For the "what" part, see Documentation/lockdep-design.txt. You'll enable
> it by with SPINLOCK_DEBUG, indeed.

Well I hope to see it hit the BUG again soon then to see what it has to
say.

--
Len Sorensen

2007-05-04 17:45:07

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Fri, May 04, 2007 at 11:40:09AM -0400, Lennart Sorensen wrote:
> On Fri, May 04, 2007 at 05:34:38PM +0200, Frederik Deweerdt wrote:
> > For the "what" part, see Documentation/lockdep-design.txt. You'll enable
> > it by with SPINLOCK_DEBUG, indeed.
>
> Well I hope to see it hit the BUG again soon then to see what it has to
> say.

Well I didn't see anything for a while with SPINLOCK_DEBUG enabled
(maybe I didn't wait long enough). So I tried changing it to
spin_lock_irqsave, and that didn't go well. I got this as the result
now:

onfiguring network interfaces...eth1: link up, 100Mbps, full-duplex
BUG: spinlock recursion on CPU#0, ifconfig/962
lock: cf7a3304, .magic: dead4ead, .owner: ifconfig/962, .owner_cpu: 0
[<c0104024>] dump_stack+0x24/0x30
[<c01e3947>] _raw_spin_lock+0x137/0x140
[<c02981ec>] _spin_lock_irqsave+0x1c/0x30
[<d084eb86>] pcnet32_interrupt+0x216/0x290 [pcnet32]
[<c013b95d>] handle_IRQ_event+0x3d/0x70
[<c013ba2c>] __do_IRQ+0x9c/0x120
[<c0105025>] do_IRQ+0x25/0x60
[<c010316a>] common_interrupt+0x1a/0x20
[<c011927a>] __do_softirq+0x3a/0xa0
[<c011930d>] do_softirq+0x2d/0x30
[<c0119557>] irq_exit+0x37/0x40
[<c010502a>] do_IRQ+0x2a/0x60
[<c010316a>] common_interrupt+0x1a/0x20
[<c02983c0>] _spin_unlock_irqrestore+0x10/0x40
[<d08517ea>] pcnet32_open+0x27a/0x390 [pcnet32]
[<c02343e9>] dev_open+0x39/0x80
[<c0232b5a>] dev_change_flags+0xfa/0x130
[<c0277b7f>] devinet_ioctl+0x4ff/0x6f0
[<c0227b24>] sock_ioctl+0xf4/0x1f0
[<c017027c>] do_ioctl+0x2c/0x80
[<c0170322>] vfs_ioctl+0x52/0x2f0
[<c017062f>] sys_ioctl+0x6f/0x80
[<c0102f27>] syscall_call+0x7/0xb
[<b7eebd04>] 0xb7eebd04
BUG: spinlock lockup on CPU#0, ifconfig/962, cf7a3304
[<c0104024>] dump_stack+0x24/0x30
[<c01e391f>] _raw_spin_lock+0x10f/0x140
[<c02981ec>] _spin_lock_irqsave+0x1c/0x30
[<d084eb86>] pcnet32_interrupt+0x216/0x290 [pcnet32]
[<c013b95d>] handle_IRQ_event+0x3d/0x70
[<c013ba2c>] __do_IRQ+0x9c/0x120
[<c0105025>] do_IRQ+0x25/0x60
[<c010316a>] common_interrupt+0x1a/0x20
[<c011927a>] __do_softirq+0x3a/0xa0
[<c011930d>] do_softirq+0x2d/0x30
[<c0119557>] irq_exit+0x37/0x40
[<c010502a>] do_IRQ+0x2a/0x60
[<c010316a>] common_interrupt+0x1a/0x20
[<c02983c0>] _spin_unlock_irqrestore+0x10/0x40
[<d08517ea>] pcnet32_open+0x27a/0x390 [pcnet32]
[<c02343e9>] dev_open+0x39/0x80
[<c0232b5a>] dev_change_flags+0xfa/0x130
[<c0277b7f>] devinet_ioctl+0x4ff/0x6f0
[<c0227b24>] sock_ioctl+0xf4/0x1f0
[<c017027c>] do_ioctl+0x2c/0x80
[<c0170322>] vfs_ioctl+0x52/0x2f0
[<c017062f>] sys_ioctl+0x6f/0x80
[<c0102f27>] syscall_call+0x7/0xb
[<b7eebd04>] 0xb7eebd04

Obviously that wasn't so good.

--
Len Sorensen

2007-05-04 17:50:23

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Fri, May 04, 2007 at 01:44:56PM -0400, Lennart Sorensen wrote:
> On Fri, May 04, 2007 at 11:40:09AM -0400, Lennart Sorensen wrote:
> > On Fri, May 04, 2007 at 05:34:38PM +0200, Frederik Deweerdt wrote:
> > > For the "what" part, see Documentation/lockdep-design.txt. You'll enable
> > > it by with SPINLOCK_DEBUG, indeed.
> >
> > Well I hope to see it hit the BUG again soon then to see what it has to
> > say.
>
> Well I didn't see anything for a while with SPINLOCK_DEBUG enabled
> (maybe I didn't wait long enough). So I tried changing it to
> spin_lock_irqsave, and that didn't go well. I got this as the result
> now:
>
> onfiguring network interfaces...eth1: link up, 100Mbps, full-duplex
> BUG: spinlock recursion on CPU#0, ifconfig/962
> lock: cf7a3304, .magic: dead4ead, .owner: ifconfig/962, .owner_cpu: 0
> [<c0104024>] dump_stack+0x24/0x30
> [<c01e3947>] _raw_spin_lock+0x137/0x140
> [<c02981ec>] _spin_lock_irqsave+0x1c/0x30
> [<d084eb86>] pcnet32_interrupt+0x216/0x290 [pcnet32]
> [<c013b95d>] handle_IRQ_event+0x3d/0x70
> [<c013ba2c>] __do_IRQ+0x9c/0x120
> [<c0105025>] do_IRQ+0x25/0x60
> [<c010316a>] common_interrupt+0x1a/0x20
> [<c011927a>] __do_softirq+0x3a/0xa0
> [<c011930d>] do_softirq+0x2d/0x30
> [<c0119557>] irq_exit+0x37/0x40
> [<c010502a>] do_IRQ+0x2a/0x60
> [<c010316a>] common_interrupt+0x1a/0x20
> [<c02983c0>] _spin_unlock_irqrestore+0x10/0x40
> [<d08517ea>] pcnet32_open+0x27a/0x390 [pcnet32]
> [<c02343e9>] dev_open+0x39/0x80
> [<c0232b5a>] dev_change_flags+0xfa/0x130
> [<c0277b7f>] devinet_ioctl+0x4ff/0x6f0
> [<c0227b24>] sock_ioctl+0xf4/0x1f0
> [<c017027c>] do_ioctl+0x2c/0x80
> [<c0170322>] vfs_ioctl+0x52/0x2f0
> [<c017062f>] sys_ioctl+0x6f/0x80
> [<c0102f27>] syscall_call+0x7/0xb
> [<b7eebd04>] 0xb7eebd04
> BUG: spinlock lockup on CPU#0, ifconfig/962, cf7a3304
> [<c0104024>] dump_stack+0x24/0x30
> [<c01e391f>] _raw_spin_lock+0x10f/0x140
> [<c02981ec>] _spin_lock_irqsave+0x1c/0x30
> [<d084eb86>] pcnet32_interrupt+0x216/0x290 [pcnet32]
> [<c013b95d>] handle_IRQ_event+0x3d/0x70
> [<c013ba2c>] __do_IRQ+0x9c/0x120
> [<c0105025>] do_IRQ+0x25/0x60
> [<c010316a>] common_interrupt+0x1a/0x20
> [<c011927a>] __do_softirq+0x3a/0xa0
> [<c011930d>] do_softirq+0x2d/0x30
> [<c0119557>] irq_exit+0x37/0x40
> [<c010502a>] do_IRQ+0x2a/0x60
> [<c010316a>] common_interrupt+0x1a/0x20
> [<c02983c0>] _spin_unlock_irqrestore+0x10/0x40
> [<d08517ea>] pcnet32_open+0x27a/0x390 [pcnet32]
> [<c02343e9>] dev_open+0x39/0x80
> [<c0232b5a>] dev_change_flags+0xfa/0x130
> [<c0277b7f>] devinet_ioctl+0x4ff/0x6f0
> [<c0227b24>] sock_ioctl+0xf4/0x1f0
> [<c017027c>] do_ioctl+0x2c/0x80
> [<c0170322>] vfs_ioctl+0x52/0x2f0
> [<c017062f>] sys_ioctl+0x6f/0x80
> [<c0102f27>] syscall_call+0x7/0xb
> [<b7eebd04>] 0xb7eebd04
>
> Obviously that wasn't so good.

Nevermind. I am obviously an idiot today placing spin_lock_irqsave both
in place of spin_lock and spin_unlock. Yeah that will work well. Now
to try with spin_lock_irqrestore or whatever it is called.

--
Len Sorensen

2007-05-04 18:55:26

by Don Fry

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

All instances of obtaining the lock in pcnet32 are done as
spin_lock_irqsave except the interrupt handler itself. The interrupt mask
needs to be saved everywhere else, but the interrupt handler is known not
to need to save the flags.

If the lock is held and the same CPU tries to get the lock again, it will
wait a very long time ;-(. I believe the locking is fine for a
non-preemptable kernel, but I have little experience with a preemptable
kernel.

When does a preemptable kernel allow interrupts to occur?
Is there a bug in this particular architectures locking code?
>From looking at preempt-locking.txt the driver has (1) no per-cpu data,
(2) 'CPU state protection' should be fine,
(3) the 'lock is acquired and released by the same task'.
I don't see a problem unless I am misunderstanding something.

Don

2007-05-04 19:02:38

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Fri, May 04, 2007 at 11:24:33AM -0700, Don Fry wrote:
> All instances of obtaining the lock in pcnet32 are done as
> spin_lock_irqsave except the interrupt handler itself. The interrupt mask
> needs to be saved everywhere else, but the interrupt handler is known not
> to need to save the flags.
>
> If the lock is held and the same CPU tries to get the lock again, it will
> wait a very long time ;-(. I believe the locking is fine for a
> non-preemptable kernel, but I have little experience with a preemptable
> kernel.
>
> When does a preemptable kernel allow interrupts to occur?

I have no idea actually.

> Is there a bug in this particular architectures locking code?

On i386? I hope not.

> From looking at preempt-locking.txt the driver has (1) no per-cpu data,
> (2) 'CPU state protection' should be fine,
> (3) the 'lock is acquired and released by the same task'.
> I don't see a problem unless I am misunderstanding something.

Well I don't know, but something is going wrong and causing the soft
lock up. I must admit I am surprised if an interrupt can occour while
handling an interrupt, but then again maybe that is supposed to be
allowed.

--
Len Sorensen

2007-05-07 14:08:14

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Fri, May 04, 2007 at 03:02:36PM -0400, Lennart Sorensen wrote:
> Well I don't know, but something is going wrong and causing the soft
> lock up. I must admit I am surprised if an interrupt can occour while
> handling an interrupt, but then again maybe that is supposed to be
> allowed.

I tried building a kernel where the only change was enabling the spin
lock debugging. It doesn't fail, while without spin lock debugging it
seemed to fail very frequently. Darn! I hate when debugging makes hides
the problem.

--
Len Sorensen

2007-05-07 14:50:02

by Frederik Deweerdt

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Mon, May 07, 2007 at 10:08:07AM -0400, Lennart Sorensen wrote:
> On Fri, May 04, 2007 at 03:02:36PM -0400, Lennart Sorensen wrote:
> > Well I don't know, but something is going wrong and causing the soft
> > lock up. I must admit I am surprised if an interrupt can occour while
> > handling an interrupt, but then again maybe that is supposed to be
> > allowed.
>
> I tried building a kernel where the only change was enabling the spin
> lock debugging. It doesn't fail, while without spin lock debugging it
> seemed to fail very frequently. Darn! I hate when debugging makes hides
> the problem.
Can you try running on another Geode LX system, just to rule out a
hardware problem on you board?

Regards,
Frederik

2007-05-07 17:45:18

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Mon, May 07, 2007 at 04:48:37PM +0200, Frederik Deweerdt wrote:
> Can you try running on another Geode LX system, just to rule out a
> hardware problem on you board?

Hmm, I thought I saw it on two systems already, but I should go try that
again.

--
Len Sorensen

2007-06-04 15:06:38

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange soft lockup detected message (looks like spin_lock bug in pcnet32)

On Mon, May 07, 2007 at 01:45:11PM -0400, Lennart Sorensen wrote:
> Hmm, I thought I saw it on two systems already, but I should go try that
> again.

Hmm, still haven't figured this out. I just saw this one this morning:

BUG: soft lockup detected on CPU#0!
[<c0103fc4>] dump_stack+0x24/0x30
[<c013d36e>] softlockup_tick+0x7e/0xc0
[<c011eb23>] update_process_times+0x33/0x80
[<c01062c9>] timer_interrupt+0x39/0x80
[<c013d6fd>] handle_IRQ_event+0x3d/0x70
[<c013da59>] __do_IRQ+0xa9/0x150
[<c0104e55>] do_IRQ+0x25/0x60
[<c010313a>] common_interrupt+0x1a/0x20
[<c013d6d8>] handle_IRQ_event+0x18/0x70
[<c013da59>] __do_IRQ+0xa9/0x150
[<c0104e55>] do_IRQ+0x25/0x60
[<c010313a>] common_interrupt+0x1a/0x20
[<c0119cda>] __do_softirq+0x3a/0xa0
[<c0119d6d>] do_softirq+0x2d/0x30
[<c0119fb7>] irq_exit+0x37/0x40
[<c0104e5a>] do_IRQ+0x2a/0x60
[<c010313a>] common_interrupt+0x1a/0x20
[<c013dcee>] setup_irq+0xce/0x1e0
[<c013de97>] request_irq+0x97/0xb0
[<d0851f9d>] pcnet32_open+0x4d/0x3d0 [pcnet32]
[<c023a4f9>] dev_open+0x39/0x80
[<c0238cea>] dev_change_flags+0xfa/0x130
[<c027eb9f>] devinet_ioctl+0x4ff/0x6f0
[<c022dab1>] sock_ioctl+0xf1/0x1f0
[<c017413c>] do_ioctl+0x2c/0x80
[<c01741e2>] vfs_ioctl+0x52/0x2f0
[<c01744ef>] sys_ioctl+0x6f/0x80
[<c0102ef7>] syscall_call+0x7/0xb
[<b7f41d04>] 0xb7f41d04

And it is happening on multiple systems. I am starting to wonder if it
is a bug in the soft lockup detection. Maybe it really isn't locked up
but just momentarily appears to be. I will try turning off the soft
lockup detection and see what happens.

--
Len Sorensen