2006-02-20 19:43:43

by Jeff Mahoney

[permalink] [raw]
Subject: [PATCH] tg3: netif_carrier_off runs too early; could still be queued when init fails

This patch moves the netif_carrier_off() call from tg3_init_one()->
tg3_init_link_config() to tg3_open() as is the convention for most
other network drivers.

I was getting a panic after a tg3 device failed to initialize due to DMA
failure. The oops pointed to the link watch queue with spinlock debugging
enabled. Without spinlock debugging, the Oops didn't occur.

I suspect that the link event was getting queued but not executed until
after the DMA test had failed and the device was freed. The link event
was then operating on freed memory, which could contain anything. With this
patch applied, the Oops no longer occurs.

I believe this to be correct, but since I'm not a network guru, perhaps
someone who is could comment?

Signed-off-by: Jeff Mahoney <[email protected]>

diff -ruNpX dontdiff linux-2.6.16-rc4/drivers/net/tg3.c linux-2.6.16-rc4.tg3/drivers/net/tg3.c
--- linux-2.6.16-rc4/drivers/net/tg3.c 2006-02-20 13:52:40.000000000 -0500
+++ linux-2.6.16-rc4.tg3/drivers/net/tg3.c 2006-02-20 13:54:20.000000000 -0500
@@ -6443,6 +6443,8 @@ static int tg3_open(struct net_device *d
struct tg3 *tp = netdev_priv(dev);
int err;

+ netif_carrier_off(tp->dev);
+
tg3_full_lock(tp, 0);

tg3_disable_ints(tp);
@@ -10430,7 +10432,6 @@ static void __devinit tg3_init_link_conf
tp->link_config.speed = SPEED_INVALID;
tp->link_config.duplex = DUPLEX_INVALID;
tp->link_config.autoneg = AUTONEG_ENABLE;
- netif_carrier_off(tp->dev);
tp->link_config.active_speed = SPEED_INVALID;
tp->link_config.active_duplex = DUPLEX_INVALID;
tp->link_config.phy_is_low_power = 0;
--
Jeff Mahoney
SuSE Labs


2006-02-21 18:25:14

by Michael Chan

[permalink] [raw]
Subject: Re: [PATCH] tg3: netif_carrier_off runs too early; could still be queued when init fails

On Mon, 2006-02-20 at 14:43 -0500, Jeff Mahoney wrote:
> This patch moves the netif_carrier_off() call from tg3_init_one()->
> tg3_init_link_config() to tg3_open() as is the convention for most
> other network drivers.

I think moving netif_carrier_off() later is the right thing to do. We
can also move it to the end of tg3_init_one() just before returning 0.

>
> I was getting a panic after a tg3 device failed to initialize due to DMA
> failure. The oops pointed to the link watch queue with spinlock debugging
> enabled. Without spinlock debugging, the Oops didn't occur.
>
> I suspect that the link event was getting queued but not executed until
> after the DMA test had failed and the device was freed. The link event
> was then operating on freed memory, which could contain anything. With this
> patch applied, the Oops no longer occurs.

DMA test failed? What NIC device do you have? How did it fail?

Thanks.


2006-02-21 21:40:04

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tg3: netif_carrier_off runs too early; could still be queued when init fails

From: "Michael Chan" <[email protected]>
Date: Tue, 21 Feb 2006 08:44:20 -0800

> On Mon, 2006-02-20 at 14:43 -0500, Jeff Mahoney wrote:
> > This patch moves the netif_carrier_off() call from tg3_init_one()->
> > tg3_init_link_config() to tg3_open() as is the convention for most
> > other network drivers.
>
> I think moving netif_carrier_off() later is the right thing to do. We
> can also move it to the end of tg3_init_one() just before returning 0.

Agreed.

> > I was getting a panic after a tg3 device failed to initialize due to DMA
> > failure. The oops pointed to the link watch queue with spinlock debugging
> > enabled. Without spinlock debugging, the Oops didn't occur.
> >
> > I suspect that the link event was getting queued but not executed until
> > after the DMA test had failed and the device was freed. The link event
> > was then operating on freed memory, which could contain anything. With this
> > patch applied, the Oops no longer occurs.
>
> DMA test failed? What NIC device do you have? How did it fail?

I get this too with an old 5700 3COM card on sparc64. I'll get
you some more detailed info later today, hopefully.

Jeff, please get some details for Michael about your failure
case. Thanks.

2006-02-21 22:40:59

by Jeff Mahoney

[permalink] [raw]
Subject: Re: [PATCH] tg3: netif_carrier_off runs too early; could still be queued when init fails

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David S. Miller wrote:
> From: "Michael Chan" <[email protected]>
> Date: Tue, 21 Feb 2006 08:44:20 -0800
>
>> On Mon, 2006-02-20 at 14:43 -0500, Jeff Mahoney wrote:
>>> This patch moves the netif_carrier_off() call from tg3_init_one()->
>>> tg3_init_link_config() to tg3_open() as is the convention for most
>>> other network drivers.
>> I think moving netif_carrier_off() later is the right thing to do. We
>> can also move it to the end of tg3_init_one() just before returning 0.
>
> Agreed.
>
>>> I was getting a panic after a tg3 device failed to initialize due to DMA
>>> failure. The oops pointed to the link watch queue with spinlock debugging
>>> enabled. Without spinlock debugging, the Oops didn't occur.
>>>
>>> I suspect that the link event was getting queued but not executed until
>>> after the DMA test had failed and the device was freed. The link event
>>> was then operating on freed memory, which could contain anything. With this
>>> patch applied, the Oops no longer occurs.
>> DMA test failed? What NIC device do you have? How did it fail?
>
> I get this too with an old 5700 3COM card on sparc64. I'll get
> you some more detailed info later today, hopefully.
>
> Jeff, please get some details for Michael about your failure
> case. Thanks.

dmesg after modprobe tg3:
tg3.c:v3.49 (Feb 2, 2006)
ACPI: PCI Interrupt 0000:0a:02.0[A] -> GSI 24 (level, low) -> IRQ 201
Uhhuh. NMI received for unknown reason 21 on CPU 0.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
tg3_test_dma() Write the buffer failed -19
tg3: DMA engine test failed, aborting.

relevant lspci output:
0000:0a:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5701
Gigabit Ethernet (rev 15)
Subsystem: Compaq Computer Corporation NC7770 Gigabit Server
Adapter (PCI-X, 10/100/1000-T)
Flags: 66Mhz, medium devsel, IRQ 201
Memory at f7df0000 (64-bit, non-prefetchable) [size=64K]
Expansion ROM at dc080000 [disabled] [size=64K]
Capabilities: [40] PCI-X non-bridge device.
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: 64bit+
Queue=0/3 Enable-

If you need more details, I can try to dig them up. This is a machine
I've only been using for a few days for some testing and I'm not yet
familiar with all the hardware details.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+5cXLPWxlyuTD7IRArVcAJ4nEwx2b1j50dJp1uLBbqKJp9UlGwCfbE9N
P9iCdmB7IvGXDsuxTyjRj5M=
=5QzB
-----END PGP SIGNATURE-----

2006-02-21 23:38:49

by Michael Chan

[permalink] [raw]
Subject: Re: [PATCH] tg3: netif_carrier_off runs too early; could still be queued when init fails

On Tue, 2006-02-21 at 17:41 -0500, Jeff Mahoney wrote:

>
> dmesg after modprobe tg3:
> tg3.c:v3.49 (Feb 2, 2006)
> ACPI: PCI Interrupt 0000:0a:02.0[A] -> GSI 24 (level, low) -> IRQ 201
> Uhhuh. NMI received for unknown reason 21 on CPU 0.
> Dazed and confused, but trying to continue
> Do you have a strange power saving mode enabled?
> tg3_test_dma() Write the buffer failed -19
> tg3: DMA engine test failed, aborting.
>

You're getting an NMI during tg3_init_one() which means that the NIC is
probably bad. I did a quick test on the same version of the 5701 NIC
with the same tg3 driver and it worked fine.

Please find out if the NIC is known to be bad. Thanks.



2006-02-22 00:35:34

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tg3: netif_carrier_off runs too early; could still be queued when init fails

From: "Michael Chan" <[email protected]>
Date: Tue, 21 Feb 2006 13:57:28 -0800

> On Tue, 2006-02-21 at 17:41 -0500, Jeff Mahoney wrote:
>
> > dmesg after modprobe tg3:
> > tg3.c:v3.49 (Feb 2, 2006)
> > ACPI: PCI Interrupt 0000:0a:02.0[A] -> GSI 24 (level, low) -> IRQ 201
> > Uhhuh. NMI received for unknown reason 21 on CPU 0.
> > Dazed and confused, but trying to continue
> > Do you have a strange power saving mode enabled?
> > tg3_test_dma() Write the buffer failed -19
> > tg3: DMA engine test failed, aborting.
>
> You're getting an NMI during tg3_init_one() which means that the NIC is
> probably bad. I did a quick test on the same version of the 5701 NIC
> with the same tg3 driver and it worked fine.
>
> Please find out if the NIC is known to be bad. Thanks.

I wonder if this is how this platform informs the cpu of master-abort
or target-abort cycles? It could maybe also be an IRQ routing
problem...

2006-02-22 16:48:19

by Jeff Mahoney

[permalink] [raw]
Subject: Re: [PATCH] tg3: netif_carrier_off runs too early; could still be queued when init fails

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Chan wrote:
> On Tue, 2006-02-21 at 17:41 -0500, Jeff Mahoney wrote:
>
>> dmesg after modprobe tg3:
>> tg3.c:v3.49 (Feb 2, 2006)
>> ACPI: PCI Interrupt 0000:0a:02.0[A] -> GSI 24 (level, low) -> IRQ 201
>> Uhhuh. NMI received for unknown reason 21 on CPU 0.
>> Dazed and confused, but trying to continue
>> Do you have a strange power saving mode enabled?
>> tg3_test_dma() Write the buffer failed -19
>> tg3: DMA engine test failed, aborting.
>>
>
> You're getting an NMI during tg3_init_one() which means that the NIC is
> probably bad. I did a quick test on the same version of the 5701 NIC
> with the same tg3 driver and it worked fine.
>
> Please find out if the NIC is known to be bad. Thanks.

Up until recently, this NIC was reported to work. I booted our
2.6.5-based SLES9 kernel on it. This is the kernel the machine has been
running for a while with the NIC working, and when I booted it, I got
the same DMA failure messages as with 2.6.16-rc4.

I suspect that the hardware has just recently failed, and I figured it
was a hardware problem when I saw the NMI/DMA messages, but since I
don't have physical access to the hardware, immediate removal wasn't an
option.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD/JX2LPWxlyuTD7IRAiKOAKCmFcjKzmyJEVF63hsm5zxPFVwNBACdHTR7
CghdO/WCfh4mwCaH1uwh1fc=
=Z5Ey
-----END PGP SIGNATURE-----