2007-02-01 19:22:42

by Thomas Glanzmann

[permalink] [raw]
Subject: sky2 hangs

Hello,
I have a sky2 network card in my intel mac mini. It stops working when I
do havy network load like watching a divx over http/sshfs. However if I
remove the driver module and load it again it works and even the tcp
connection doesn't get shutdown. I automated the above procedure using
a userland watchdog which basically does the same thing and is written
entirely by me, because the traditional watchdog wasn't that reliable
and did a lot of false positives:

* Look every ten seconds if my default router is pingable (3
pings, one has to get back).
If it isn't the case I call network_fix script (it calls the
script only once after a ping gets lost. To run the script again at least one
ping has to arrive again)

(mini) [~] cat /usr/local/sbin/fix_network
#!/bin/bash

export PATH=/bin:/usr/bin:/usr/sbin:/sbin

rmmod sky2
modprobe sky2
ifdown eth0
ifup eth0

If after that no ping is received from the default
router for another 90 seconds I tell init to reboot and
stop feeding the kernel software watchdog.

* My watchdog also checks if sshd process is running. If it is
down for more than 100 seconds it reboots the machine, too.

Jan 27 22:35:35 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
Jan 27 22:35:35 mini watchdog-tg[4146]: Running fix_network script.
Jan 27 22:38:46 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
Jan 27 22:38:46 mini watchdog-tg[4146]: Running fix_network script.
Jan 27 22:44:17 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
Jan 27 22:44:17 mini watchdog-tg[4146]: Running fix_network script.
Jan 29 12:00:13 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
Jan 29 12:00:13 mini watchdog-tg[4146]: Running fix_network script.
Jan 29 19:18:59 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
Jan 29 19:18:59 mini watchdog-tg[4146]: Running fix_network script.
Jan 31 15:56:29 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
Jan 31 15:56:29 mini watchdog-tg[4146]: Running fix_network script.
Feb 1 08:56:57 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
Feb 1 08:56:57 mini watchdog-tg[4146]: Running fix_network script.

I have a question to this: I wonder why the Linux Kernel (no longer?)
increments the use counter of an ethernet driver (I saw it on sky2 and
e1000) when the interface is up, running and configured? I can unload
the sky2 driver without doing a 'ifconfig eth0 down' beforehand. Could
somone provide me with background on this fact?

With that everything works. If somone is interested in my userland
watchdog, just send me an E-Mail.

@Sam: I can provide you access to my hardware including root access via
the wifi driver so that you can debug this network driver lockup, if you
want to.

Thomas


2007-02-01 19:09:16

by Stephen Hemminger

[permalink] [raw]
Subject: Re: sky2 hangs

On Thu, 1 Feb 2007 19:55:32 +0100
Thomas Glanzmann <[email protected]> wrote:

> Hello,
> I have a sky2 network card in my intel mac mini. It stops working when I
> do havy network load like watching a divx over http/sshfs. However if I
> remove the driver module and load it again it works and even the tcp
> connection doesn't get shutdown. I automated the above procedure using
> a userland watchdog which basically does the same thing and is written
> entirely by me, because the traditional watchdog wasn't that reliable
> and did a lot of false positives:
>
> * Look every ten seconds if my default router is pingable (3
> pings, one has to get back).
> If it isn't the case I call network_fix script (it calls the
> script only once after a ping gets lost. To run the script again at least one
> ping has to arrive again)
>
> (mini) [~] cat /usr/local/sbin/fix_network
> #!/bin/bash
>
> export PATH=/bin:/usr/bin:/usr/sbin:/sbin
>
> rmmod sky2
> modprobe sky2
> ifdown eth0
> ifup eth0
>
> If after that no ping is received from the default
> router for another 90 seconds I tell init to reboot and
> stop feeding the kernel software watchdog.
>
> * My watchdog also checks if sshd process is running. If it is
> down for more than 100 seconds it reboots the machine, too.
>
> Jan 27 22:35:35 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
> Jan 27 22:35:35 mini watchdog-tg[4146]: Running fix_network script.
> Jan 27 22:38:46 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
> Jan 27 22:38:46 mini watchdog-tg[4146]: Running fix_network script.
> Jan 27 22:44:17 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
> Jan 27 22:44:17 mini watchdog-tg[4146]: Running fix_network script.
> Jan 29 12:00:13 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
> Jan 29 12:00:13 mini watchdog-tg[4146]: Running fix_network script.
> Jan 29 19:18:59 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
> Jan 29 19:18:59 mini watchdog-tg[4146]: Running fix_network script.
> Jan 31 15:56:29 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
> Jan 31 15:56:29 mini watchdog-tg[4146]: Running fix_network script.
> Feb 1 08:56:57 mini watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
> Feb 1 08:56:57 mini watchdog-tg[4146]: Running fix_network script.
>
> I have a question to this: I wonder why the Linux Kernel (no longer?)
> increments the use counter of an ethernet driver (I saw it on sky2 and
> e1000) when the interface is up, running and configured? I can unload
> the sky2 driver without doing a 'ifconfig eth0 down' beforehand. Could
> somone provide me with background on this fact?

It was intentional in 2.6 to allow interfaces to be hot-removed.
Remember with Internet protocols there is no hard binding (normally)
between address and device and connections should not go down
if link fails.

>
> With that everything works. If somone is interested in my userland
> watchdog, just send me an E-Mail.

Hopefully, it won't be necessary for long.

2007-02-01 19:10:30

by Stephen Hemminger

[permalink] [raw]
Subject: Re: sky2 hangs

On Thu, 1 Feb 2007 19:55:32 +0100
Thomas Glanzmann <[email protected]> wrote:

> Hello,
> I have a sky2 network card in my intel mac mini. It stops working when I
> do havy network load like watching a divx over http/sshfs.

Is this heavy Tx load (ie your watching movie from mac mini). or Rx load
(you are watching movie on mac mini).

2007-02-01 19:16:52

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: sky2 hangs

Hello Sam,

> Is this heavy Tx load (ie your watching movie from mac mini). or Rx
> load (you are watching movie on mac mini).

it's inbound (Rx) traffic. Watching a Movie, git pull from linus, or scp
kernel tar tree from my laptop to my mac mini.

Thomas

2007-02-01 19:19:05

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: sky2 hangs

Hello Stephen,

> It was intentional in 2.6 to allow interfaces to be hot-removed.
> Remember with Internet protocols there is no hard binding (normally)
> between address and device and connections should not go down if link
> fails.

of course. That makes sense. I just wondered when the change in mind
happened. And actually I like this behaviour.

> > With that everything works. If somone is interested in my userland
> > watchdog, just send me an E-Mail.

> Hopefully, it won't be necessary for long.

So do I.

Thomas

2007-02-01 23:01:15

by Stephen Hemminger

[permalink] [raw]
Subject: Re: sky2 hangs

I can reproduce the problem now (on mac mini). Interestingly it seems to whack
the whole ethernet switch when it happens.
>
> - a previously suggested fix - passing idle=poll to the kernel - did not
> work for me at the end

It is not an MSI or IRQ problem. It is a phy problem (see below).

> - the locks I have happen very periodically (somewhere around every 22-28
> hours), as if the chip would die after a given amount of data transferred;
> I know this looks stupid but I thought I might mention it
> - I have about 1Mbit/s of (incoming) traffic on this interface: with
> short, very high peaks, as there is a MySQL server on the other end,
> receiving about 100 queries per second
> - unloading the sky2 module totally freezes the computer for me

If you do:
ethtool -r eth0
it cause a PHY reset (renegotiation) and clears the problem.


--
Stephen Hemminger <[email protected]>

2007-02-01 23:15:10

by Fagyal Csongor

[permalink] [raw]
Subject: Re: sky2 hangs

> Hello,
> I have a sky2 network card in my intel mac mini. It stops working when I
> do havy network load like watching a divx over http/sshfs. However if I
> remove the driver module and load it again it works and even the tcp
> connection doesn't get shutdown. I automated the above procedure using a
> userland watchdog which basically does the same thing and is written
> entirely by me, because the traditional watchdog wasn't that reliable
> and did a lot of false positives:
>
> * Look every ten seconds if my default router is pingable (3
> pings, one has to get back).
> If it isn't the case I call network_fix script (it calls
> the script only once after a ping gets lost. To run the
> script again at least one ping has to arrive again)
>
> (mini) [~] cat /usr/local/sbin/fix_network
> #!/bin/bash
>
> export PATH=/bin:/usr/bin:/usr/sbin:/sbin
>
> rmmod sky2
> modprobe sky2
> ifdown eth0
> ifup eth0
>
> If after that no ping is received from the default
> router for another 90 seconds I tell init to reboot and
> stop feeding the kernel software watchdog.
>
> * My watchdog also checks if sshd process is running. If it is
> down for more than 100 seconds it reboots the machine, too.
>
> Jan 27 22:35:35 mini watchdog-tg[4146]: No PONG received from
> 192.168.0.3 (failure 1 of 10) Jan 27 22:35:35 mini watchdog-tg[4146]:
> Running fix_network script. Jan 27 22:38:46 mini watchdog-tg[4146]: No
> PONG received from 192.168.0.3 (failure 1 of 10) Jan 27 22:38:46 mini
> watchdog-tg[4146]: Running fix_network script. Jan 27 22:44:17 mini
> watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
> Jan 27 22:44:17 mini watchdog-tg[4146]: Running fix_network script. Jan
> 29 12:00:13 mini watchdog-tg[4146]: No PONG received from 192.168.0.3
> (failure 1 of 10) Jan 29 12:00:13 mini watchdog-tg[4146]: Running
> fix_network script. Jan 29 19:18:59 mini watchdog-tg[4146]: No PONG
> received from 192.168.0.3 (failure 1 of 10) Jan 29 19:18:59 mini
> watchdog-tg[4146]: Running fix_network script. Jan 31 15:56:29 mini
> watchdog-tg[4146]: No PONG received from 192.168.0.3 (failure 1 of 10)
> Jan 31 15:56:29 mini watchdog-tg[4146]: Running fix_network script. Feb
> 1 08:56:57 mini watchdog-tg[4146]: No PONG received from 192.168.0.3
> (failure 1 of 10) Feb 1 08:56:57 mini watchdog-tg[4146]: Running
> fix_network script.
[...]

I would like to add a few things:

- a previously suggested fix - passing idle=poll to the kernel - did not
work for me at the end
- the locks I have happen very periodically (somewhere around every 22-28
hours), as if the chip would die after a given amount of data transferred;
I know this looks stupid but I thought I might mention it
- I have about 1Mbit/s of (incoming) traffic on this interface: with
short, very high peaks, as there is a MySQL server on the other end,
receiving about 100 queries per second
- unloading the sky2 module totally freezes the computer for me



- Fagzal


2007-02-02 06:27:25

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: sky2 hangs

Hello Fagyal,

> - a previously suggested fix - passing idle=poll to the kernel - did not
> work for me at the end

same for me. I tried the two module parameters and the kernel parameter:

pci=nomsi sky2.disable_msi=1 sky2.idle_timeout=1000

> - the locks I have happen very periodically (somewhere around every 22-28
> hours), as if the chip would die after a given amount of data transferred;
> I know this looks stupid but I thought I might mention it

I had a dedicated server with sky2 which had the same symptoms but I
disabled the onboard sky2 to and added a e100. On my mac mini I can
reproduce it nearly immediately. Just have to scp a kernel tar tree over
and it hangs.

> - unloading the sky2 module totally freezes the computer for me

For me it works pretty good.

Thomas

2007-02-02 06:31:08

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: sky2 hangs

Hello Stephen,

> I can reproduce the problem now (on mac mini). Interestingly it seems
> to whack the whole ethernet switch when it happens.

wow. I have Linksys wrt54g has 'ethernet switch' and my Snom 320 VoIP
phone still works when the mini network card goes down. On the other
side the wrt54g isn't exactly a switch but more like a bunch of network
cards which use the linux bridging code IIRC.

> If you do:
> ethtool -r eth0
> it cause a PHY reset (renegotiation) and clears the problem.

But this isn't related to my problen (on mac mini), is it?

Thomas

2007-02-02 10:49:20

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: sky2 hangs

Hello,

> Next time sky2 hangs on me I'll try to reset the PHY and see if that
> helps. I can usually trigger the hang by doing a couple of ifconfig
> up/down on the interface, though I'm not getting any error message
> from the driver when that happens.

same for me. In dmesg is absolut nothing. I change my fix script, too.
To see if that is enough to resolv the problem.

Thomas

2007-02-02 11:07:37

by Julien BLACHE

[permalink] [raw]
Subject: Re: sky2 hangs

Thomas Glanzmann <[email protected]> wrote:

Hi,

>> I can reproduce the problem now (on mac mini). Interestingly it seems
>> to whack the whole ethernet switch when it happens.

I've observed that too, on a cheap DLink switch.

Next time sky2 hangs on me I'll try to reset the PHY and see if that
helps. I can usually trigger the hang by doing a couple of ifconfig
up/down on the interface, though I'm not getting any error message
from the driver when that happens.

JB.

--
Julien BLACHE <http://www.jblache.org>
<[email protected]> GPG KeyID 0xF5D65169

2007-02-02 11:55:26

by Fagyal Csongor

[permalink] [raw]
Subject: Re: sky2 hangs

Thomas Glanzmann wrote:

>Hello,
>
>
>
>>Next time sky2 hangs on me I'll try to reset the PHY and see if that
>>helps. I can usually trigger the hang by doing a couple of ifconfig
>>up/down on the interface, though I'm not getting any error message
>>from the driver when that happens.
>>
>>
>
>same for me. In dmesg is absolut nothing. I change my fix script, too.
>To see if that is enough to resolv the problem.
>
>
Well, ethtool -r eth0 did not work for me. :(

This time I got nothing in the log.

When I say ethtool -r eth0, I have this:
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both

But the interface stays down. (Maybe the other end got confused?)


- Cs.

2007-02-02 13:40:30

by Jarek Poplawski

[permalink] [raw]
Subject: Re: sky2 hangs

On 02-02-2007 12:53, Fagyal Csongor wrote:
> Thomas Glanzmann wrote:
...
>>> Next time sky2 hangs on me I'll try to reset the PHY and see if that
>>> helps. I can usually trigger the hang by doing a couple of ifconfig
>>> up/down on the interface, though I'm not getting any error message
>>> from the driver when that happens.
>>>
>>
>> same for me. In dmesg is absolut nothing. I change my fix script, too.
>> To see if that is enough to resolv the problem.
>>
>>
> Well, ethtool -r eth0 did not work for me. :(
>
> This time I got nothing in the log.
>
> When I say ethtool -r eth0, I have this:
> sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both
>
> But the interface stays down. (Maybe the other end got confused?)

Hi,

Is this with this yesterday sky2-tx-recover.patch applied?

Jarek P.

2007-02-02 14:10:41

by Jarek Poplawski

[permalink] [raw]
Subject: Re: sky2 hangs

On Fri, Feb 02, 2007 at 02:43:11PM +0100, Jarek Poplawski wrote:
> On 02-02-2007 12:53, Fagyal Csongor wrote:
> > Thomas Glanzmann wrote:
...
> Is this with this yesterday sky2-tx-recover.patch applied?

I mean hung-ups - not ethtool.

Regards,
Jarek P.