Hi,
I've got an ultra 1 with on-board HME that I'm trying to install linux
on via the serial console.
I'm network boot-strapping via the debian install described on
http://www.debian.org/releases/stable/sparc/install.en.html#contents
using files downloaded from
debian/dists/woody/main/disks-sparc/3.0.23-2002-05-21/sun4u
The result is
$ uname -a
Linux godot 2.4.18 #2 Thu Apr 11 14:37:17 EDT 2002 sparc64 unknown
and there are no modules currently installed. The network device is
attached to a 100mbit hub. All machines on this hub have mtu set
to 576.
Because of screw-ups, I've downloaded significant amounts of debian a
couple of times in the last 3 days over dial-up, and this has worked
flawlessly. However, a couple of times, I've seen a network freeze
(while the serial console has continued to work...) after local network
activity.
Once, this was an ftp of a gentoo stage 1 bootstrap image from a box on
the same subnet, and on other occasions, shell commands run via openssh
with priv-sep with significant output (like "find") run concurrently
with downloads have also cause a freeze.
The resulting freeze leaves a box which doesn't respond to pings, and
cannot ping other hosts. "ifconfig eth down" then "up" does not resolve
the problem.
The following is the cut'n'paste of the last such freeze, which occured
while I was running part of a gentoo bootstrap on the serial console:
<snip earlier stuff>
>>> Downloading
http://www.ibiblio.org/pub/Linux/distributions/gentoo/distfiles/binutils-manpages-2.11.92.0.12.3.tar.bz2
--19:34:34--
http://www.ibiblio.org/pub/Linux/distributions/gentoo/distfiles/binutils-manpages-2.11.92.0.12.3.tar.bz2
=>
`/usr/portage/distfiles/binutils-manpages-2.11.92.0.12.3.tar.bz2'
Resolving http://www.ibiblio.org... NETDEV WATCHDOG: eth0: transmit timed out
eth0: transmit timed out, resetting
eth0: Happy Status 03030000 TX[000003ff:00000101]
failed: Host not found.
>>> Downloading
http://www.ibiblio.org/gentoo/distfiles/binutils-manpages-2.11.92.0.12.3.tar.bz2
--19:35:14--
http://www.ibiblio.org/gentoo/distfiles/binutils-manpages-2.11.92.0.12.3.tar.bz2
=>
`/usr/portage/distfiles/binutils-manpages-2.11.92.0.12.3.tar.bz2'
Resolving http://www.ibiblio.org... eth0: Link is up using internal transceiver
at 100Mb/s, Half Duplex.
NETDEV WATCHDOG: eth0: transmit timed out
eth0: transmit timed out, resetting
eth0: Happy Status 03010000 TX[000003ff:00000101]
eth0: Link is up using internal transceiver at 100Mb/s, Half Duplex.
failed: Host not found.
!!! Couldn't download binutils-manpages-2.11.92.0.12.3.tar.bz2.
Aborting.
!!! emerge aborting on
/usr/portage/sys-devel/binutils/binutils-2.11.92.0.12.3-r2.ebuild .
godot:/usr/portage/scripts# NETDEV WATCHDOG: eth0: transmit timed out
eth0: transmit timed out, resetting
eth0: Happy Status 03010000 TX[000003ff:00000101]
eth0: Link is up using internal transceiver at 100Mb/s, Half Duplex.
<end quote>
Reboot seems to cure the problem (via break on the console and then
issuing a boot command at the prom monitor), until further stress is
applied.
Any ideas would be welcome.
Regards
Kieran Barry
* Kieran <[email protected]> [020811 13:58]:
> I've got an ultra 1 with on-board HME that I'm trying to install linux
> on via the serial console.
...
> NETDEV WATCHDOG: eth0: transmit timed out
> eth0: transmit timed out, resetting
> eth0: Happy Status 03030000 TX[000003ff:00000101]
...
> Reboot seems to cure the problem (via break on the console and then
> issuing a boot command at the prom monitor), until further stress is
> applied.
Yep... known issue with the sunhme driver. AFAIK, it only affects the
HME onboard the U1E systems and no other HMEs. The quick band-aid
work-around is at the end of this email... seems to be some weirdo
timing issue. This patch has resolved the issue for several people with
U1E's. (E == enterprise... UPA (for a Creator3D), wide scsi and hme
(insted of an le on the non-E models)).
--- drivers/net/sunhme.c.orig Mon Jul 15 02:38:27 2002
+++ drivers/net/sunhme.c Mon Jul 15 03:09:03 2002
@@ -1983,6 +1983,7 @@
}
hp->tx_old = elem;
TXD((">"));
+ udelay(1);
if (netif_queue_stopped(dev) &&
TX_BUFFS_AVAIL(hp) > (MAX_SKB_FRAGS + 1))
From: Joshua Uziel <[email protected]>
Date: Sun, 11 Aug 2002 15:30:11 -0700
Yep... known issue with the sunhme driver.
...
--- drivers/net/sunhme.c.orig Mon Jul 15 02:38:27 2002
+++ drivers/net/sunhme.c Mon Jul 15 03:09:03 2002
Hmmm, can the people who can reproduce this try this patch
instead?
--- drivers/net/sunhme.c.~1~ Sun Aug 11 18:37:34 2002
+++ drivers/net/sunhme.c Sun Aug 11 18:38:17 2002
@@ -1640,6 +1640,7 @@
HMD((", enable global interrupts, "));
hme_write32(hp, gregs + GREG_IMASK,
(GREG_IMASK_GOTFRAME | GREG_IMASK_RCNTEXP |
+ GREG_IMASK_TXALL |
GREG_IMASK_SENTFRAME | GREG_IMASK_TXPERR));
/* Set the transmit ring buffer size. */
@@ -2125,8 +2126,8 @@
happy_meal_mif_interrupt(hp);
}
- if (happy_status & GREG_STAT_TXALL) {
- HMD(("TXALL "));
+ if (happy_status & GREG_STAT_HOSTTOTX) {
+ HMD(("HOSTTOTX "));
happy_meal_tx(hp);
}
@@ -2155,7 +2156,7 @@
if (!(happy_status & (GREG_STAT_ERRORS |
GREG_STAT_MIFIRQ |
- GREG_STAT_TXALL |
+ GREG_STAT_HOSTTOTX |
GREG_STAT_RXTOHOST)))
continue;
@@ -2172,8 +2173,8 @@
happy_meal_mif_interrupt(hp);
}
- if (happy_status & GREG_STAT_TXALL) {
- HMD(("TXALL "));
+ if (happy_status & GREG_STAT_HOSTTOTX) {
+ HMD(("HOSTTOTX "));
happy_meal_tx(hp);
}
* David S. Miller <[email protected]> [020811 18:44]:
> Hmmm, can the people who can reproduce this try this patch instead?
Yeah Dave, that seems to do it... at least with my testing on a U1/170E.
Previously, simply ping flooding it from another box would cause the
lockup pretty quickly.
I'll try to get some more people to test it as well...
* Joshua Uziel <[email protected]> [020811 22:14]:
> * David S. Miller <[email protected]> [020811 18:44]:
> > Hmmm, can the people who can reproduce this try this patch instead?
>
> Yeah Dave, that seems to do it... at least with my testing on a U1/170E.
> Previously, simply ping flooding it from another box would cause the
> lockup pretty quickly.
Damn, scratch that, dude... it definitely helps, but I just had it
happen to me with your patch. It's much harder to reproduce with your
patch...
From: Joshua Uziel <[email protected]>
Date: Mon, 12 Aug 2002 00:11:41 -0700
Damn, scratch that, dude... it definitely helps, but I just had it
happen to me with your patch. It's much harder to reproduce with your
patch...
Is there anything interesting in the logs when it happens?
Are there any error statistics in /proc/net/dev incremented
when it happens?