2002-10-06 03:33:26

by Ben Greear

[permalink] [raw]
Subject: Update on e1000 troubles (over-heating!)

I believe I have figured out why the e1000 crashed my machine
after .5 - 1 hours: The NIC was over-heating. I measured one of
the NICs after the machine crashed with an external (cheap) temp
probe. It registered right at 50 degrees C, and this was about 15-30
seconds after it crashed.

The dual e1000 NIC I have seems to run much cooler, and has been
running at 430Mbps bi-directional on both ports for about 6 hours now
with no obvious problems.

So, I'm going to try to purchase some heat sinks and glue them onto
the e1000 server nics, to see if that fixes the problem.

Hope this proves useful to anyone experiencing similar strange
crashes!

Thanks,
Ben

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear



2002-10-06 03:44:18

by Andre Hedrick

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)


I have a pair of Compaq e1000's which have never overheated, and I use
them for heavy duty iSCSI testing and designing of drivers. These are
massive 66/64 cards but still nothing like what you are reporting.

I will look some more at the issue soon.

Cheers,

Andre Hedrick
iSCSI Software Solutions Provider
http://www.PyXTechnologies.com/


On Sat, 5 Oct 2002, Ben Greear wrote:

> I believe I have figured out why the e1000 crashed my machine
> after .5 - 1 hours: The NIC was over-heating. I measured one of
> the NICs after the machine crashed with an external (cheap) temp
> probe. It registered right at 50 degrees C, and this was about 15-30
> seconds after it crashed.
>
> The dual e1000 NIC I have seems to run much cooler, and has been
> running at 430Mbps bi-directional on both ports for about 6 hours now
> with no obvious problems.
>
> So, I'm going to try to purchase some heat sinks and glue them onto
> the e1000 server nics, to see if that fixes the problem.
>
> Hope this proves useful to anyone experiencing similar strange
> crashes!
>
> Thanks,
> Ben
>
> --
> Ben Greear <[email protected]> <Ben_Greear AT excite.com>
> President of Candela Technologies Inc http://www.candelatech.com
> ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-10-06 12:16:10

by Feldman, Scott

[permalink] [raw]
Subject: RE: Update on e1000 troubles (over-heating!)

> I believe I have figured out why the e1000 crashed my machine
> after .5 - 1 hours: The NIC was over-heating. I measured
> one of the NICs after the machine crashed with an external
> (cheap) temp probe. It registered right at 50 degrees C, and
> this was about 15-30 seconds after it crashed.

Ben, please send lspci -x on the hot nic.

-scott

2002-10-06 22:40:32

by jamal

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)



On Sat, 5 Oct 2002, Andre Hedrick wrote:

>
> I have a pair of Compaq e1000's which have never overheated, and I use
> them for heavy duty iSCSI testing and designing of drivers. These are
> massive 66/64 cards but still nothing like what you are reporting.
>
> I will look some more at the issue soon.
>

It seems like the prerequisite to reproduce it is you beat the NIC heavily
with a lot of packets/sec and then run it at that sustained rate for at
least 30 minutes. isci would tend to use MTU sized packets which will
not be that effective.

cheers,
jamal




2002-10-07 00:11:56

by Andre Hedrick

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)


However doing a data integrity test with a pattern buffer
write-verify-read on multi-lun, multi-session, and multiple connections
per session, while issuing load-balancing commands (ie thread tag) over
each session to roast the bandwidth of the line should be enough.

Now toss in injected errors to randomly fail data pdu's and calling a
sync-and-steering layer to scan the header and or data digests to execute
a within connection recovery, regardless if the reason, should be enough
to warm up the beast.

If that is not enough, I can toss in multi-initiators all with the
features above or invoke the interoperablity modes to add the cisco and
ibm initiator (both limited to error recovery level zero, while pyx's is
capable of error recovery level one and part of two).

Please let me know if I need to throttle it harder.

Cheers,

On Sun, 6 Oct 2002, jamal wrote:

>
>
> On Sat, 5 Oct 2002, Andre Hedrick wrote:
>
> >
> > I have a pair of Compaq e1000's which have never overheated, and I use
> > them for heavy duty iSCSI testing and designing of drivers. These are
> > massive 66/64 cards but still nothing like what you are reporting.
> >
> > I will look some more at the issue soon.
> >
>
> It seems like the prerequisite to reproduce it is you beat the NIC heavily
> with a lot of packets/sec and then run it at that sustained rate for at
> least 30 minutes. isci would tend to use MTU sized packets which will
> not be that effective.
>
> cheers,
> jamal
>
>
>
>

Andre Hedrick
iSCSI Software Solutions Provider
http://www.PyXTechnologies.com/

2002-10-07 03:41:21

by Ben Greear

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

jamal wrote:

> It seems like the prerequisite to reproduce it is you beat the NIC heavily
> with a lot of packets/sec and then run it at that sustained rate for at
> least 30 minutes. isci would tend to use MTU sized packets which will
> not be that effective.

I can reproduce my crash using mtu sized pkts running only 50Mbps send + receive
on 2 nics. It took over-night to do it though. Running as hard as I can with
MTU packets will crash it as well, and much quicker.

Interestingly enough, the tg3 NIC (netgear 302t), registered 57 deg C between
the fins of it's heat sink in the 32-bit slots. Makes me wonder if my PCI bus
is running too hot :P

Dave says I'm wierd and no one else sees these bizarre problems, btw :)

More trouble-shooting to follow this next week.

Thanks,
Ben


--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear


2002-10-07 05:27:48

by David Miller

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

From: Ben Greear <[email protected]>
Date: Sun, 06 Oct 2002 20:46:42 -0700

Dave says I'm wierd and no one else sees these bizarre problems, btw :)

The only case where I'm really concerned about the health
of your PCI controller is the most recent case you've
reported to me where pci_find_capability(pdev, PCI_CAP_ID_PM)
fails. That is just completely bizarre.

I hope your boards aren't being permanently harmed by your box which
is overheating.:(

2002-10-07 11:55:15

by jamal

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)



On Sun, 6 Oct 2002, Ben Greear wrote:

> I can reproduce my crash using mtu sized pkts running only 50Mbps
> send + receive on 2 nics. It took over-night to do it though. Running
> as hard as I can with MTU packets will crash it as well, and much
>quicker.
>

So is there a correlation with packet count then?


> Interestingly enough, the tg3 NIC (netgear 302t), registered 57 deg C between
> the fins of it's heat sink in the 32-bit slots. Makes me wonder if my PCI bus
> is running too hot :P

Does the problem happen with the tg3?

cheers,
jamal


2002-10-07 11:58:38

by jamal

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)



It does seem like you need a lot of packets over a period of time
to recreate it. So if what you are trying to do can achieve that,
you should reproduce it. How many connections and sessions can you
support? BTW, does iscsi call for a zero-copy receive?

cheers,
jamal



2002-10-07 12:00:08

by David Miller

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

From: jamal <[email protected]>
Date: Mon, 7 Oct 2002 07:53:26 -0400 (EDT)

Does the problem happen with the tg3?

He gets hangs in one box, inoperable PCI config space accesses for the
cards in another box.

2002-10-07 16:35:14

by Ben Greear

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

jamal wrote:
>
> On Sun, 6 Oct 2002, Ben Greear wrote:
>
>
>>I can reproduce my crash using mtu sized pkts running only 50Mbps
>>send + receive on 2 nics. It took over-night to do it though. Running
>>as hard as I can with MTU packets will crash it as well, and much
>>quicker.
>>
>
>
> So is there a correlation with packet count then?

No, running at slower speeds (50Mbps), the packet count was well over
4 billion (ie it successfully wrapped 32-bits). At higher speeds, it
crashes before the 32-bit wrap, generally. It also does not coorelate
to bytes-sent/received, or anything else that I could think of to look at.

>
>
>
>>Interestingly enough, the tg3 NIC (netgear 302t), registered 57 deg C between
>>the fins of it's heat sink in the 32-bit slots. Makes me wonder if my PCI bus
>>is running too hot :P
>
>
> Does the problem happen with the tg3?

As Dave mentioned, tg3 locks up almost immediately (like within 30 seconds),
and in the meantime, it's spitting out errors that are 'impossible'. The
messages I sent a day or two ago.

I may have cooked my cards, or something like that, because one of
the tg3's do not work in my other machine now. Still trouble-shooting that one.

Ben


>
> cheers,
> jamal
>
>


--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear


2002-10-08 18:40:08

by Ben Greear

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11)
00: 22 10 0c 70 06 00 30 22 11 00 00 06 00 40 00 00
10: 08 00 00 f8 08 00 20 f6 91 10 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 00

00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge
00: 22 10 0d 70 07 00 20 02 00 00 04 06 00 40 01 00
10: 00 00 00 00 00 00 00 00 00 01 01 44 f1 01 20 22
20: f0 ff 00 00 f0 ff 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 04 00

00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05)
00: 22 10 40 74 0f 00 20 02 05 00 01 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04)
00: 22 10 41 74 05 00 00 02 04 8a 01 01 00 40 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 01 f0 00 00 00 00 00 00 00 00 00 00 22 10 41 74
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
00: 22 10 43 74 00 00 80 02 03 00 80 06 00 40 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 22 10 43 74
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:08.0 Ethernet controller: Intel Corp.: Unknown device 100f (rev 01)
00: 86 80 0f 10 17 00 30 02 01 00 00 02 10 40 00 00
10: 04 00 00 f4 00 00 00 00 00 00 00 00 00 00 00 00
20: 01 10 00 00 00 00 00 00 00 00 00 00 86 80 01 10
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0a 01 ff 00

00:09.0 Ethernet controller: Intel Corp.: Unknown device 100f (rev 01)
00: 86 80 0f 10 17 00 30 02 01 00 00 02 10 40 00 00
10: 04 00 02 f4 00 00 00 00 00 00 00 00 00 00 00 00
20: 41 10 00 00 00 00 00 00 00 00 00 00 86 80 01 10
30: 00 00 00 00 dc 00 00 00 00 00 00 00 09 01 ff 00

00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05)
00: 22 10 48 74 17 00 20 22 05 00 04 06 00 63 01 00
10: 00 00 00 00 00 00 00 00 00 02 02 a8 20 20 00 22
20: 10 f4 f0 f5 f0 ff 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 0c 00

02:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-768 [Opus] USB (rev 07)
00: 22 10 49 74 17 00 80 82 07 10 03 0c 00 40 00 00
10: 00 00 10 f4 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 22 10 49 74
30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 04 00 50

02:07.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00: 02 10 52 47 87 00 90 02 27 00 00 03 10 42 00 00
10: 00 00 00 f5 01 20 00 00 00 10 10 f4 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 02 10 08 80
30: 00 00 00 00 5c 00 00 00 00 00 00 00 ff 00 08 00

02:08.0 Ethernet controller: 3Com Corporation 3c980-TX 10/100baseTX NIC [Python-T] (rev 78)
00: b7 10 05 98 17 00 10 02 78 00 00 02 10 50 00 00
10: 01 24 00 00 00 20 10 f4 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 f1 10 62 24
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0b 01 0a 0a

02:09.0 Ethernet controller: 3Com Corporation 3c980-TX 10/100baseTX NIC [Python-T] (rev 78)
00: b7 10 05 98 17 00 10 02 78 00 00 02 10 50 00 00
10: 81 24 00 00 00 24 10 f4 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 f1 10 62 24
30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a

00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11)
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
Latency: 64
Region 0: Memory at f8000000 (32-bit, prefetchable) [size=64M]
Region 1: Memory at f6200000 (32-bit, prefetchable) [size=4K]
Region 2: I/O ports at 1090 [disabled] [size=4]
Capabilities: [a0] AGP version 2.0
Status: RQ=15 SBA+ 64bit- FW- Rate=x1,x2
Command: RQ=0 SBA+ AGP+ 64bit- FW- Rate=<none>

00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64
Bus: primary=00, secondary=01, subordinate=01, sec-latency=68
BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B-

00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05)
Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0

00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04) (prog-if 8a [Master SecP PriP])
Subsystem: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64
Region 4: I/O ports at f000 [size=16]

00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
Subsystem: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-

00:08.0 Ethernet controller: Intel Corp.: Unknown device 100f (rev 01)
Subsystem: Intel Corp.: Unknown device 1001
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (63750ns min), cache line size 10
Interrupt: pin A routed to IRQ 10
Region 0: Memory at f4000000 (64-bit, non-prefetchable) [size=128K]
Region 4: I/O ports at 1000 [size=64]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [e4] PCI-X non-bridge device.
Command: DPERE- ERO+ RBC=0 OST=0
Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM-
Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000

00:09.0 Ethernet controller: Intel Corp.: Unknown device 100f (rev 01)
Subsystem: Intel Corp.: Unknown device 1001
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (63750ns min), cache line size 10
Interrupt: pin A routed to IRQ 9
Region 0: Memory at f4020000 (64-bit, non-prefetchable) [size=128K]
Region 4: I/O ports at 1040 [size=64]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [e4] PCI-X non-bridge device.
Command: DPERE- ERO+ RBC=0 OST=0
Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM-
Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000

00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
Latency: 99
Bus: primary=00, secondary=02, subordinate=02, sec-latency=168
I/O behind bridge: 00002000-00002fff
Memory behind bridge: f4100000-f5ffffff
BridgeCtl: Parity- SERR- NoISA+ VGA+ MAbort- >Reset- FastB2B-

02:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-768 [Opus] USB (rev 07) (prog-if 10 [OHCI])
Subsystem: Advanced Micro Devices [AMD] AMD-768 [Opus] USB
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR+
Latency: 64 (20000ns max)
Interrupt: pin D routed to IRQ 10
Region 0: Memory at f4100000 (32-bit, non-prefetchable) [size=4K]

02:07.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) (prog-if 00 [VGA])
Subsystem: ATI Technologies Inc: Unknown device 8008
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 66 (2000ns min), cache line size 10
Region 0: Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
Region 1: I/O ports at 2000 [size=256]
Region 2: Memory at f4101000 (32-bit, non-prefetchable) [size=4K]
Expansion ROM at <unassigned> [disabled] [size=128K]
Capabilities: [5c] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

02:08.0 Ethernet controller: 3Com Corporation 3c980-TX 10/100baseTX NIC [Python-T] (rev 78)
Subsystem: Tyan Computer: Unknown device 2462
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 80 (2500ns min, 2500ns max), cache line size 10
Interrupt: pin A routed to IRQ 11
Region 0: I/O ports at 2400 [size=128]
Region 1: Memory at f4102000 (32-bit, non-prefetchable) [size=128]
Expansion ROM at <unassigned> [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=2 PME-

02:09.0 Ethernet controller: 3Com Corporation 3c980-TX 10/100baseTX NIC [Python-T] (rev 78)
Subsystem: Tyan Computer: Unknown device 2462
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 80 (2500ns min, 2500ns max), cache line size 10
Interrupt: pin A routed to IRQ 5
Region 0: I/O ports at 2480 [size=128]
Region 1: Memory at f4102400 (32-bit, non-prefetchable) [size=128]
Expansion ROM at <unassigned> [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=2 PME-


Attachments:
lspci.txt (10.53 kB)

2002-10-15 02:14:21

by Feldman, Scott

[permalink] [raw]
Subject: RE: Update on e1000 troubles (over-heating!)

> Here is the lspci information, both -x and -vv. This is with
> two of the e1000 single-port NICS side-by-side. I have also
> strapped a P-IV CPU fan on top of the two cards to blow some
> air over them....running tests now to see if that actually
> helps anything. If it does, I'll be sure to send you a picture :)

Ben, I checked the datasheet for the part shown in the lspci dump, and it
shows an operating temperature of 0-55 degrees C. You said you measured 50
degrees C, so you're within the safe range. Did the fans help?

Here's the datasheet:
http://www.intel.com/network/connectivity/resources/doc_library/data_sheets/
pro1000mt_sa.pdf

-scott

2002-10-15 02:31:34

by Andi Kleen

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

On Mon, Oct 14, 2002 at 07:20:04PM -0700, Feldman, Scott wrote:
> > Here is the lspci information, both -x and -vv. This is with
> > two of the e1000 single-port NICS side-by-side. I have also
> > strapped a P-IV CPU fan on top of the two cards to blow some
> > air over them....running tests now to see if that actually
> > helps anything. If it does, I'll be sure to send you a picture :)
>
> Ben, I checked the datasheet for the part shown in the lspci dump, and it
> shows an operating temperature of 0-55 degrees C. You said you measured 50
> degrees C, so you're within the safe range. Did the fans help?

The thermometer he used likely showed a much lower temperature than what was
actually on the die. 5-10 C more are not unlikely. It's hard to measure chip
temperatures accurately without an on die thermal diode or special kit.
So I would expect that when an external normal thermometer showed 50C
it was already operating out of spec.

-Andi

2002-10-15 02:48:32

by Jonathan Lundell

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

At 4:37am +0200 10/15/02, Andi Kleen wrote:
> > Ben, I checked the datasheet for the part shown in the lspci dump, and it
>> shows an operating temperature of 0-55 degrees C. You said you measured 50
>> degrees C, so you're within the safe range. Did the fans help?
>
>The thermometer he used likely showed a much lower temperature than what was
>actually on the die. 5-10 C more are not unlikely. It's hard to measure chip
>temperatures accurately without an on die thermal diode or special kit.
>So I would expect that when an external normal thermometer showed 50C
>it was already operating out of spec.

The datasheet's for the card, so the operating temperature is surely
ambient, not die temperature. "Ambient measured how?" would be a
reasonable question, though.
--
/Jonathan Lundell.

2002-10-15 05:37:17

by Dave Hansen

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

Feldman, Scott wrote:
>>Here is the lspci information, both -x and -vv. This is with
>>two of the e1000 single-port NICS side-by-side. I have also
>>strapped a P-IV CPU fan on top of the two cards to blow some
>>air over them....running tests now to see if that actually
>>helps anything. If it does, I'll be sure to send you a picture :)
>
> Ben, I checked the datasheet for the part shown in the lspci dump, and it
> shows an operating temperature of 0-55 degrees C. You said you measured 50
> degrees C, so you're within the safe range. Did the fans help?
>
> Here's the datasheet:
> http://www.intel.com/network/connectivity/resources/doc_library/data_sheets/
> pro1000mt_sa.pdf

I get some strange e1000 failures too. It usually involves the
watchdog kicking them back into order, but sometimes they'll stay
offline for a while. Heat would explain it, though, because it only
happens when I'm actually using the cards for a benchmark. I figured
that it was either my cables, or a shoddy switch.

The new dual-port e1000 that I have doesn't seem to have this problem,
even though I'm running 4 times more traffic than the singles that I
had.
--
Dave Hansen
[email protected]

2002-10-15 06:56:43

by Ben Greear

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

Feldman, Scott wrote:
>>Here is the lspci information, both -x and -vv. This is with
>>two of the e1000 single-port NICS side-by-side. I have also
>>strapped a P-IV CPU fan on top of the two cards to blow some
>>air over them....running tests now to see if that actually
>>helps anything. If it does, I'll be sure to send you a picture :)
>
>
> Ben, I checked the datasheet for the part shown in the lspci dump, and it
> shows an operating temperature of 0-55 degrees C. You said you measured 50
> degrees C, so you're within the safe range. Did the fans help?

The fan did help, and Andi is right, the chip was much hotter than what
my probe read (I was gently pushing it against the top of the chip, cause it
was too hot to really press my finger against it to get good contact :))

With the fan blowing on the chips, it has been perfect. This implies to me
that if you are going to run the e1000, you need significant air-flow over
the chipset, and the generic 2U chassis that I have is definately inadequate,
partially because the MB is so big that the fans are too far away from the
PCI slots... This is all doubly true if you are running two NICs side-by-side,
which is what I was doing.

I am also considering glueing heat-sinks onto the main chip, which may make it
work in more marginal environments.

Ben

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear


2002-10-15 07:01:54

by Ben Greear

[permalink] [raw]
Subject: Re: Update on e1000 troubles (over-heating!)

Dave Hansen wrote:

>
> I get some strange e1000 failures too. It usually involves the watchdog
> kicking them back into order, but sometimes they'll stay offline for a
> while. Heat would explain it, though, because it only happens when I'm
> actually using the cards for a benchmark. I figured that it was either
> my cables, or a shoddy switch.
>
> The new dual-port e1000 that I have doesn't seem to have this problem,
> even though I'm running 4 times more traffic than the singles that I had.

That was exactly the behaviour I noticed. I believe it's because when you
run two side-by-side, they cook each other (I'm assuming you didn't run
2 2-ports side-by-side)

Try strapping a fan on them somehow and I bet all your troubles go
away (and maybe your .ibm email will shame Intel into putting heat-sinks
and/or small fans on their NICs... ;)

(I ran two Netgear 302t NICs (tigon-3) side-by-side for 4 days at max speed, and they
didn't drop a single packet, even though their heat-sinks were too hot to
touch!)

Ben

--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear