2006-02-16 19:39:56

by Stephan von Krawczynski

[permalink] [raw]
Subject: Severe problem with e1000 driver in 2.4.31/32 (at least)

Hello,

today we had to experience a bug in e1000 network driver on this type of
network card (a current PCI e1000 sold everywhere):

02:07.0 Ethernet controller: Intel Corp.: Unknown device 107c (rev 05)
Subsystem: Intel Corp.: Unknown device 1376
Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 18
Memory at f9000000 (32-bit, non-prefetchable) [size=128K]
Memory at f9020000 (32-bit, non-prefetchable) [size=128K]
I/O ports at a800 [size=64]
Expansion ROM at <unassigned> [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Capabilities: [e4] PCI-X non-bridge device.

We can simply crash the box (2.4.32 stock kernel, 2.4.31 is the same) by
performing this:

1) Boot box with network physically disconnected
2) start pinging some host somewhere, you get "unreachable", let it run
3) connect the network and await some ping replies.
4) disconnect the network again
5) box is dead

The box runs into a BUG in e1000_hw.c line 5052. The BUG shows up because the
code is obviously executed inside an interrupt, which seems not intended.
As this BUG is always reproducable and pretty annoying we made this pretty bad
workaround:

In e1000_osdep.h replace:

#define msec_delay(x) do { if(in_interrupt()) { \
/* Don't mdelay in interrupt context! */ \
BUG(); \
} else { \
set_current_state(TASK_UNINTERRUPTIBLE); \
schedule_timeout((x * HZ)/1000 + 2); \
} } while(0)

with

#define msec_delay(x) do { if(in_interrupt()) { \
/* Don't mdelay in interrupt context! */ \
mdelay(x); \
} else { \
set_current_state(TASK_UNINTERRUPTIBLE); \
schedule_timeout((x * HZ)/1000 + 2); \
} } while(0)


Obviously this is not a solution, but it saves the world at least. This is why
I offer no applyable patch.

Someone with inside knowledge of this driver should take a look why
e1000_config_dsp_after_link_change in e1000_hw.c is executed in interrupt
context. Btw I don't know if this shows up in 2.6 either.

--
Regards,
Stephan


2006-02-16 21:49:19

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: Severe problem with e1000 driver in 2.4.31/32 (at least)

> ---------- Forwarded message ----------
> From: Stephan von Krawczynski <[email protected]>
> today we had to experience a bug in e1000 network driver on this type of
> network card (a current PCI e1000 sold everywhere):
>
> 02:07.0 Ethernet controller: Intel Corp.: Unknown device 107c (rev 05)
> Subsystem: Intel Corp.: Unknown device 1376
> Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 18
> Memory at f9000000 (32-bit, non-prefetchable) [size=128K]
> Memory at f9020000 (32-bit, non-prefetchable) [size=128K]
> I/O ports at a800 [size=64]
> Expansion ROM at <unassigned> [disabled] [size=128K]
> Capabilities: [dc] Power Management version 2
> Capabilities: [e4] PCI-X non-bridge device.
>
> We can simply crash the box (2.4.32 stock kernel, 2.4.31 is the same) by
> performing this:
>
> 1) Boot box with network physically disconnected
> 2) start pinging some host somewhere, you get "unreachable", let it run
> 3) connect the network and await some ping replies.
> 4) disconnect the network again
> 5) box is dead
>
> The box runs into a BUG in e1000_hw.c line 5052. The BUG shows up because the
> code is obviously executed inside an interrupt, which seems not intended.
> As this BUG is always reproducable and pretty annoying we made this pretty bad
> workaround:

Please try this patch, compile tested. It matches up this particular code
to what is currently in 2.6.16-rc

e1000: fix BUG reported due to calling msec_delay in irq context

There are some functions that are called in irq context that need to use
msec_delay_irq instead to avoid a BUG.

Signed-off-by: Jesse Brandeburg <[email protected]>

---

drivers/net/e1000/e1000_hw.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/e1000/e1000_hw.c b/drivers/net/e1000/e1000_hw.c
--- a/drivers/net/e1000/e1000_hw.c
+++ b/drivers/net/e1000/e1000_hw.c
@@ -5049,7 +5049,7 @@ e1000_config_dsp_after_link_change(struc
if(ret_val)
return ret_val;

- msec_delay(20);
+ msec_delay_irq(20);

ret_val = e1000_write_phy_reg(hw, 0x0000,
IGP01E1000_IEEE_FORCE_GIGA);
@@ -5073,7 +5073,7 @@ e1000_config_dsp_after_link_change(struc
if(ret_val)
return ret_val;

- msec_delay(20);
+ msec_delay_irq(20);

/* Now enable the transmitter */
ret_val = e1000_write_phy_reg(hw, 0x2F5B, phy_saved_data);
@@ -5098,7 +5098,7 @@ e1000_config_dsp_after_link_change(struc
if(ret_val)
return ret_val;

- msec_delay(20);
+ msec_delay_irq(20);

ret_val = e1000_write_phy_reg(hw, 0x0000,
IGP01E1000_IEEE_FORCE_GIGA);
@@ -5114,7 +5114,7 @@ e1000_config_dsp_after_link_change(struc
if(ret_val)
return ret_val;

- msec_delay(20);
+ msec_delay_irq(20);

/* Now enable the transmitter */
ret_val = e1000_write_phy_reg(hw, 0x2F5B, phy_saved_data);

2006-02-17 13:04:25

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Severe problem with e1000 driver in 2.4.31/32 (at least)

On Thu, 16 Feb 2006 13:49:12 -0800 (PST)
Jesse Brandeburg <[email protected]> wrote:


> > The box runs into a BUG in e1000_hw.c line 5052. The BUG shows up because
> > the code is obviously executed inside an interrupt, which seems not
> > intended. As this BUG is always reproducable and pretty annoying we made
> > this pretty bad workaround:
>
> Please try this patch, compile tested. It matches up this particular code
> to what is currently in 2.6.16-rc
>
> e1000: fix BUG reported due to calling msec_delay in irq context
>
> There are some functions that are called in irq context that need to use
> msec_delay_irq instead to avoid a BUG.
>
> Signed-off-by: Jesse Brandeburg <[email protected]>

Hello Jesse,

I can confirm the patch eliminates the problem here. I first thought about
doing it the same way but was unsure whether the function itself should execute
in interrupt at all.
Thank you for your immediate answer, please make sure to include the patch as
is in 2.4.

Stephan



>
> ---
>
> drivers/net/e1000/e1000_hw.c | 8 ++++----
> 1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/e1000/e1000_hw.c b/drivers/net/e1000/e1000_hw.c
> --- a/drivers/net/e1000/e1000_hw.c
> +++ b/drivers/net/e1000/e1000_hw.c
> @@ -5049,7 +5049,7 @@ e1000_config_dsp_after_link_change(struc
> if(ret_val)
> return ret_val;
>
> - msec_delay(20);
> + msec_delay_irq(20);
>
> ret_val = e1000_write_phy_reg(hw, 0x0000,
> IGP01E1000_IEEE_FORCE_GIGA);
> @@ -5073,7 +5073,7 @@ e1000_config_dsp_after_link_change(struc
> if(ret_val)
> return ret_val;
>
> - msec_delay(20);
> + msec_delay_irq(20);
>
> /* Now enable the transmitter */
> ret_val = e1000_write_phy_reg(hw, 0x2F5B, phy_saved_data);
> @@ -5098,7 +5098,7 @@ e1000_config_dsp_after_link_change(struc
> if(ret_val)
> return ret_val;
>
> - msec_delay(20);
> + msec_delay_irq(20);
>
> ret_val = e1000_write_phy_reg(hw, 0x0000,
> IGP01E1000_IEEE_FORCE_GIGA);
> @@ -5114,7 +5114,7 @@ e1000_config_dsp_after_link_change(struc
> if(ret_val)
> return ret_val;
>
> - msec_delay(20);
> + msec_delay_irq(20);
>
> /* Now enable the transmitter */
> ret_val = e1000_write_phy_reg(hw, 0x2F5B, phy_saved_data);
>