2007-09-05 09:22:22

by Stephen Hemminger

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Sun, 29 Jul 2007 21:01:30 -0600
Rob Sims <[email protected]> wrote:

> On Thu, Jul 26, 2007 at 06:57:01PM +0200, Adrian Bunk wrote:
> > On Thu, Jul 26, 2007 at 11:16:36AM -0400, Kyle Rose wrote:
> > > >From http://www.krose.org/~krose/computing.html:
> > >
> > > Since the sky2 driver continues to suck ass (which is a technical
> > > description for "it hangs all the time under load, at least on my
> > > hardware" :-) ), I've fixed the sk98lin driver to compile for
> > > linux-2.6.23-rc1. Those who continue to have problems with sky2 can
> > > still use 2.6.23-rc1, simply by doing the following:
> > >...
> > > Personally, I'd like to see sk98lin remain in the kernel proper until
> > > sky2 goes at least 6 months without reported problems. The fact that I
> > > am not the only one still seeing issues is a clear indication that sky2
> > > (even with the recent patches in 2.6.23-rc1) is not yet ready to replace
> > > sk98lin.
> > >...
> >
> > This sounds good in theory.
> >
> > The practical problem with this approach is that there are always many
> > people who use the old driver when the new driver doesn't work for them
> > instead of reporting their problems with the new driver.
> >
> > For these people a new driver will often suck when the old driver gets
> > removed, but after the removal of the old driver they are finally forced
> > to report their bugs resulting in a better new driver for everyone.
> >
> > The sky2 driver is since nearly 2 years in the kernel and Stephen is
> > usually quite good at handling bugs.
>
> The driver still (2.6.20/sky2 1.13) hangs for me (more rarely than in
> the past), and cycling the module generally fixes the issues. I have
> supplied all the information that Stephen has asked for, but still no
> resolution. I am not complaining about the lack of a fix, but don't
> assume that all it takes to get sky2 working is adequate bug reports. I
> have been and remain willing to test and assist debug, but after several
> dropped threads, I feel like the desire or ability to fix this issue
> isn't there (and remote debug of an intermittent hardware issue IS
> hard), and I didn't want to be a nuisance to someone that has no
> obligation to me to address the issue in the first place.
>
> Stability has improved, it's just not there yet.
>
> I'll switch to 1.16 soon, and respond to Stephen's request on netdev for
> current issues.
> --
> Rob

The only known outstanding problems on 2.62.22.6 of sky2 are:
* problems with fibre PHY based systems
* suspend/resume issues, missing multicast reinitalization, etc.
The previous stability problems have been addressed.


2007-09-05 19:42:18

by James Corey

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1


--- Stephen Hemminger
<[email protected]> wrote:

> On Sun, 29 Jul 2007 21:01:30 -0600
> Rob Sims <[email protected]> wrote:
>
> > On Thu, Jul 26, 2007 at 06:57:01PM +0200, Adrian
> Bunk wrote:
> > > On Thu, Jul 26, 2007 at 11:16:36AM -0400, Kyle
> Rose wrote:
> > > > >From
> http://www.krose.org/~krose/computing.html:
> > > >
> > > > Since the sky2 driver continues to suck ass
> (which is a technical
> > > > description for "it hangs all the time under
> load, at least on my
> > > > hardware" :-) ), I've fixed the sk98lin driver
> to compile for
> > > > linux-2.6.23-rc1. Those who continue to have
> problems with sky2 can
> > > > still use 2.6.23-rc1, simply by doing the
> following:
> > > >...
> > > > Personally, I'd like to see sk98lin remain in
> the kernel proper until
> > > > sky2 goes at least 6 months without reported
> problems. The fact that I
> > > > am not the only one still seeing issues is a
> clear indication that sky2
> > > > (even with the recent patches in 2.6.23-rc1)
> is not yet ready to replace
> > > > sk98lin.
> > > >...
> > >
> > > This sounds good in theory.
> > >
> > > The practical problem with this approach is that
> there are always many
> > > people who use the old driver when the new
> driver doesn't work for them
> > > instead of reporting their problems with the new
> driver.
> > >
> > > For these people a new driver will often suck
> when the old driver gets
> > > removed, but after the removal of the old driver
> they are finally forced
> > > to report their bugs resulting in a better new
> driver for everyone.
> > >
> > > The sky2 driver is since nearly 2 years in the
> kernel and Stephen is
> > > usually quite good at handling bugs.
> >
> > The driver still (2.6.20/sky2 1.13) hangs for me
> (more rarely than in
> > the past), and cycling the module generally fixes
> the issues. I have
> > supplied all the information that Stephen has
> asked for, but still no
> > resolution. I am not complaining about the lack
> of a fix, but don't
> > assume that all it takes to get sky2 working is
> adequate bug reports. I
> > have been and remain willing to test and assist
> debug, but after several
> > dropped threads, I feel like the desire or ability
> to fix this issue
> > isn't there (and remote debug of an intermittent
> hardware issue IS
> > hard), and I didn't want to be a nuisance to
> someone that has no
> > obligation to me to address the issue in the first
> place.
> >
> > Stability has improved, it's just not there yet.
> >
> > I'll switch to 1.16 soon, and respond to Stephen's
> request on netdev for
> > current issues.
> > --
> > Rob
>
> The only known outstanding problems on 2.62.22.6 of
> sky2 are:
> * problems with fibre PHY based systems
> * suspend/resume issues, missing multicast
> reinitalization, etc.
> The previous stability problems have been addressed.

I pretty much agree with everything said, including
the part about the sky2 people working hard on it. I
have noticed several bugs fixed recently in the driver
source.

However, it really DOES lock up under load. I even
tried 2.6.23-rc4 and the absolute latest version of
the
driver and it still locks up, as in

eth1: hw csum failure.

Call Trace:
<IRQ> [<ffffffff804779b6>]
__skb_checksum_complete_head+0x43/0x56
[<ffffffff804779d5>] __skb_checksum_complete+0xc/0x11
[<ffffffff804a989d>] tcp_v4_rcv+0x14e/0x801
[<ffffffff8048ff84>] ip_local_deliver+0xca/0x14c
[<ffffffff80490472>] ip_rcv+0x46c/0x4ae
[<ffffffff88006138>] :sky2:sky2_poll+0x72b/0x9c7
[<ffffffff80245979>] update_wall_time+0x28c/0x39b
[<ffffffff8047c934>] net_rx_action+0xa8/0x166
[<ffffffff8023901c>] do_timer+0x10/0xab
[<ffffffff80235ced>] __do_softirq+0x55/0xc4
[<ffffffff8020c5cc>] call_softirq+0x1c/0x28
[<ffffffff8020d6fd>] do_softirq+0x2c/0x7d
[<ffffffff8020d9bb>] do_IRQ+0x13e/0x15f
[<ffffffff8020a780>] mwait_idle+0x0/0x48
[<ffffffff8020b951>] ret_from_intr+0x0/0xa
<EOI> [<ffffffff804acdb9>] udp_poll+0x0/0xfb
[<ffffffff8020a7c2>] mwait_idle+0x42/0x48
[<ffffffff8020a718>] cpu_idle+0xbd/0xe0
[<ffffffff80704a5a>] start_kernel+0x2ac/0x2b8
[<ffffffff80704140>] _sinittext+0x140/0x144

As far as I can tell, this bug has been with the
sky2 driver all the way back to the Beforetime.
Based on it happening with various versions of the
driver back to 2.6.18 that I have tried, plus some
googling on it.

So while I bug reporting point is a good one, it would
be nice to have a reliable driver in the kernel until
the sky2 one is better. The alternative is to use
the vendor driver, which less than optimal.

-J



____________________________________________________________________________________
Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7

2007-09-05 21:33:44

by Kyle Rose

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1


> However, it really DOES lock up under load. I even
> tried 2.6.23-rc4 and the absolute latest version of
> the
> driver and it still locks up, as in
>
Yich. I'm glad I'm still using sk98lin on my unmanned colo box.

Kyle

2007-09-05 23:02:32

by Stephen Hemminger

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Wed, 05 Sep 2007 17:04:59 -0400
Kyle Rose <[email protected]> wrote:

>
> > However, it really DOES lock up under load. I even
> > tried 2.6.23-rc4 and the absolute latest version of
> > the
> > driver and it still locks up, as in
> >
> Yich. I'm glad I'm still using sk98lin on my unmanned colo box.
>
> Kyle
>

Great for you, when I was testing sk98lin crashed my machine on
overnight stress run. My intuition is that there is a bug in sk98lin
on Yukon EC-U chips (those without ram buffer) and a hardware
problem on Yukon XL chips (those with ram buffer) and the sky2
driver doesn't have workaround for getting the ram buffer stuck (yet).

I don't like putting workarounds in for problems I can't reproduce.
After KS, I'll rerun more stress tests on all the chip flavors
and see if the hang is reproducible.

2007-09-08 17:43:39

by Bill Davidsen

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

James Corey wrote:
> --- Stephen Hemminger
> <[email protected]> wrote:
>
>> On Sun, 29 Jul 2007 21:01:30 -0600
>> Rob Sims <[email protected]> wrote:
>>
>>> On Thu, Jul 26, 2007 at 06:57:01PM +0200, Adrian
>> Bunk wrote:

>> The only known outstanding problems on 2.62.22.6 of
>> sky2 are:
>> * problems with fibre PHY based systems
>> * suspend/resume issues, missing multicast
>> reinitalization, etc.
>> The previous stability problems have been addressed.
>
> I pretty much agree with everything said, including
> the part about the sky2 people working hard on it. I
> have noticed several bugs fixed recently in the driver
> source.
>
> However, it really DOES lock up under load. I even
> tried 2.6.23-rc4 and the absolute latest version of
> the
> driver and it still locks up, as in
>
> eth1: hw csum failure.
>
I checnged from the sk98lin to the previous driver Adrian said was the
"right one," skge IIRC. Then he started pushing sky2, and I tried that.
Like you I get hangs, but unlike you the system doesn't hang, just the
NIC. No errors, warnings, and reboot fixes it. Acts as if the cable were
pulled.

That was with 2.6.22.5 (or so), dropped back to an old kernel with
sk98lin, previously had uptimes in three digit days. Up for a week or so
now.

Haven't tried later kernels, don't intend to, while no network is really
secure, it not really useful.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-09-08 19:11:35

by Adrian Bunk

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Sat, Sep 08, 2007 at 01:44:20PM -0400, Bill Davidsen wrote:
>...
> That was with 2.6.22.5 (or so), dropped back to an old kernel with sk98lin,
> previously had uptimes in three digit days. Up for a week or so now.

There is a real long-term advantage of removing drivers like sk98lin
because it forces people to report bugs if the new driver doesn't work
instead of giving them the workaround of using the obsolete driver.
And this has the (at first sight surprising) effect that removing code
results in an improvement of the kernel.

> Haven't tried later kernels, don't intend to, while no network is really
> secure, it not really useful.

You are a regular reader of linux-kernel, and therefore the sk98lin
removal can hardly be a surprise for you. If you prefer whining over
helping to improve the kernel that's your choice...

> Bill Davidsen

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2007-09-09 02:42:30

by Kyle Rose

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1


> You are a regular reader of linux-kernel, and therefore the sk98lin
> removal can hardly be a surprise for you. If you prefer whining over
> helping to improve the kernel that's your choice...
>
In my case the issue is simply one of practicality: I cannot go to the
data center 5 times per day to reboot my colo box. Therefore, I run
sk98lin. It's really that simple.

Kyle

2007-09-09 04:49:10

by Willy Tarreau

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Sat, Sep 08, 2007 at 10:42:20PM -0400, Kyle Rose wrote:
>
> > You are a regular reader of linux-kernel, and therefore the sk98lin
> > removal can hardly be a surprise for you. If you prefer whining over
> > helping to improve the kernel that's your choice...
> >
> In my case the issue is simply one of practicality: I cannot go to the
> data center 5 times per day to reboot my colo box. Therefore, I run
> sk98lin. It's really that simple.

Adrian generally wants to force "normal" users to test new drivers in order
to quickly find bugs and fade out older ones. While this is often possible
on the desktop, it's not possible for production servers. And not everyone
can run 2.6.16.x to get a long-term stable kernel.

I think that what is really needed is to add the opposite of "experimental"
in the config options. Something like "deprecated drivers" which would be
disabled by default. Desktop users would normally not care about that and
rely only on newer drivers. Server users would have to enable the option if
they want their old driver to be present because they have no other choice.

With each driver's help text, it would be wise to add some text indicating
what will replace the driver in question, so that their users know how to
test it on non-production machines.

But I agree with Kyle that on production systems, it is not acceptable to
have a driver hang even once a month. This generally implies loss of service
and customers going away. Ideology has no place in this area, is is quickly
replaced by pragmatism.

It was the same reason I spent time trying to get sky2 to reliably work in
2.4 ; sk98lin v8 was horribly unstable. Sky2 was fairly better but did not
support some basic operations such as ifdown/ifup. sk98lin v10 finally worked
fine, and I upgraded my customer's system with it because I needed anything
which would reliably work. It was not acceptable anymore to have the customer
phone twice a week complaining that their server had crashed again.

In the long term, I would really like to get sky2 to work well in 2.4
because I'm more confident it in, it's cleaner, less obscure and less
bloated. Having passed terabytes of data through both drivers I have
not observed any glitch with sky2 as I had with sk98lin v8.

Fortunately, sky2 chips are mostly found on desktop motherboards, so that
helps the driver stabilize very quickly. It should not take as long as
the transition from eepro100 to e100.

Willy

2007-09-09 11:13:28

by Adrian Bunk

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Sat, Sep 08, 2007 at 10:42:20PM -0400, Kyle Rose wrote:
>
> > You are a regular reader of linux-kernel, and therefore the sk98lin
> > removal can hardly be a surprise for you. If you prefer whining over
> > helping to improve the kernel that's your choice...
> >
> In my case the issue is simply one of practicality: I cannot go to the
> data center 5 times per day to reboot my colo box. Therefore, I run
> sk98lin. It's really that simple.

When did you report this bug the first time?

What we need is that people when testing a new kernel they plan to use
test the new drivers *and report the bugs if they run into any*.

What could we have done so that you reported your bug without removing
the sk98lin driver?

> Kyle

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2007-09-09 15:55:07

by Chris Stromsoe

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Sat, 8 Sep 2007, Adrian Bunk wrote:
> On Sat, Sep 08, 2007 at 01:44:20PM -0400, Bill Davidsen wrote:
>
>> Haven't tried later kernels, don't intend to, while no network is
>> really secure, it not really useful.
>
> You are a regular reader of linux-kernel, and therefore the sk98lin
> removal can hardly be a surprise for you. If you prefer whining over
> helping to improve the kernel that's your choice...

I've been trying to migrate off sk98lin to skge since earlier this year,
without success, starting with 2.6.18 or .19.

I have several of these cards in production using the sk98lin driver:

fresno:~# lspci -vv -s 02:01
02:01.0 Ethernet controller: SysKonnect SK-9872 Gigabit Ethernet Server Adapter (SK-NET GE-ZX dual link) (rev 11)
Subsystem: SysKonnect SK-9844 Gigabit Ethernet Server Adapter (SK-NET GE-SX dual link)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 22
Region 0: Memory at febfc000 (32-bit, non-prefetchable) [size=16K]
Region 1: I/O ports at e800 [size=256]
Expansion ROM at febc0000 [disabled] [size=128K]
Capabilities: [48] Power Management version 1
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] Vital Product Data

They are dual port SX fiber. Both ports are connected. If I do this:

fresno:~# modprobe skge
fresno:~# ip li set eth2 up
fresno:~# ip li set eth2 down
fresno:~# ip li set eth3 up

the system locks up and I have to power cycle it. The order doesn't
matter (if I do eth3 up/down, then eth2 up kills it).

I don't have any problems with sk98lin. This works fine:

fresno:~# modprobe sk98lin RlmtMode=DualNet
fresno:~# ip li set eth2 up
fresno:~# ip li set eth2 down
fresno:~# ip li set eth3 up
fresno:~# ip li set eth3 down


I am more than happy to test various driver changes, and have tried a few
suggested patches but nothing has worked so far. I would like to be using
skge instead of sk98lin, but so far haven't had any success.




-Chris

2007-09-10 14:31:03

by Bill Davidsen

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

Adrian Bunk wrote:
> On Sat, Sep 08, 2007 at 01:44:20PM -0400, Bill Davidsen wrote:
>
>> ...
>> That was with 2.6.22.5 (or so), dropped back to an old kernel with sk98lin,
>> previously had uptimes in three digit days. Up for a week or so now.
>>
>
> There is a real long-term advantage of removing drivers like sk98lin
> because it forces people to report bugs if the new driver doesn't work
> instead of giving them the workaround of using the obsolete driver.

The issue is that sk98lin is only obsolete because you say so! skge
crashes the system, as Chris reports, sky2 just stops passing bits and
behaves as if the network cable were idle, no error messages of any
nature, ping claims it's sending packets, tcpdump claims packets are
being sent, the switch never blinks and systems on the switch see no
packets. Again, no error messages, no dumps, nothing which would help
you debug it, and it happens after some undefined time.

skge and sky2 are up to eight or ten versions now, and they still don't
work. Just because a driver works doesn't mean it's obsolete.
>
> And this has the (at first sight surprising) effect that removing code
> results in an improvement of the kernel.
>
>
>> Haven't tried later kernels, don't intend to, while no network is really
>> secure, it not really useful.
>>
>
> You are a regular reader of linux-kernel, and therefore the sk98lin
> removal can hardly be a surprise for you. If you prefer whining over
> helping to improve the kernel that's your choice...
>

I am trying to "improve the kernel" by advocating not removing reliable
drivers in favor of unreliable drivers. Saying a driver is better
because it has a clean design and good code is something I would expect
from someone who hadn't written or used code. If skge and sky2 were so
clean you wouldn't still be chasing obscure bugs after the driver had
been in the kernel for six+ versions, you wouldn't have me wasting time
trying to get a more secure kernel which is still reliable, wouldn't
have Willy Tarreau suggesting you should be marking sk98lin as obsolete
and leaving it in, wouldn't have someone maintaining sk98lin as a patch,
wouldn't have Chris Stromsoe getting hard lock-ups. No matter how ugly
sk98lin looks, and how well designed skge and sky2 may be, reliability
is not a beauty contest.

The volume of complaint should give you a hint that in this case the new
drivers aren't usefully stable for many people, and that you are
advocating a removal which is at least premature. If you can't admit
you're wrong on this one, you can say you have reconsidered the timing
of removal in light of new information.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-09-10 15:39:54

by Adrian Bunk

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Mon, Sep 10, 2007 at 10:32:45AM -0400, Bill Davidsen wrote:
> Adrian Bunk wrote:
>> On Sat, Sep 08, 2007 at 01:44:20PM -0400, Bill Davidsen wrote:
>>
>>> ...
>>> That was with 2.6.22.5 (or so), dropped back to an old kernel with
>>> sk98lin, previously had uptimes in three digit days. Up for a week or so
>>> now.
>>>
>>
>> There is a real long-term advantage of removing drivers like sk98lin
>> because it forces people to report bugs if the new driver doesn't work
>> instead of giving them the workaround of using the obsolete driver.
>
> The issue is that sk98lin is only obsolete because you say so!

No, it is obsolete because we have more than one driver for this
hardware, and the people responsible for network drivers in the kernel
decided some time ago that sk98lin is the one that is obsolete.

>...
>> And this has the (at first sight surprising) effect that removing code
>> results in an improvement of the kernel.
>>
>>
>>> Haven't tried later kernels, don't intend to, while no network is really
>>> secure, it not really useful.
>>>
>>
>> You are a regular reader of linux-kernel, and therefore the sk98lin
>> removal can hardly be a surprise for you. If you prefer whining over
>> helping to improve the kernel that's your choice...
>>
>
> I am trying to "improve the kernel" by advocating not removing reliable
> drivers in favor of unreliable drivers. Saying a driver is better because
> it has a clean design and good code is something I would expect from
> someone who hadn't written or used code. If skge and sky2 were so clean you
> wouldn't still be chasing obscure bugs after the driver had been in the
> kernel for six+ versions, you wouldn't have me wasting time trying to get a
> more secure kernel which is still reliable, wouldn't have Willy Tarreau
> suggesting you should be marking sk98lin as obsolete and leaving it in,
> wouldn't have someone maintaining sk98lin as a patch, wouldn't have Chris
> Stromsoe getting hard lock-ups. No matter how ugly sk98lin looks, and how
> well designed skge and sky2 may be, reliability is not a beauty contest.

A better written driver might still lack some workarounds for broken
hardware or similar problems. Or simply contain some bugs like all
software does.

The important word is not "reliability", it's "maintainability".
And that's something that pays off in the long term.

> The volume of complaint should give you a hint that in this case the new
> drivers aren't usefully stable for many people, and that you are advocating
> a removal which is at least premature. If you can't admit you're wrong on
> this one, you can say you have reconsidered the timing of removal in light
> of new information.

It was clear that sk98lin would go in the long term, and the only thing
that could be discussed is the when and how of removal.

When you talk about "new information", why did this information not
surface until after the sk98lin driver was removed?

Is there really a problem with "the timing of removal" or would we have
faced exactly the same problems if the removal was timed a year later?

And this is really the essence when I'm saying "removing code improves
the kernel": The goal is to get people to report if the new drivers
aren't usefully stable for them, not to use sk98lin instead without
sending a bug report.

Having different drivers with different sets of bugs and features is
not a situation that should be retained for a longer time.

The underlying question is:
Is there anything better than a quick removal of the obsolete driver to
get people to both test and report bugs with the new driver?
Keeping obsolete drivers longer only for running into exactly the same
problem later isn't an improvement.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2007-09-11 04:24:15

by Kyle Moffett

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Sep 10, 2007, at 11:39:53, Adrian Bunk wrote:
> No, it is obsolete because we have more than one driver for this
> hardware, and the people responsible for network drivers in the
> kernel decided some time ago that sk98lin is the one that is obsolete.

I would like to happily report that the sky2 driver works great in
the NIC on my tablet where the sk98lin and skge drivers both fail
utterly and hang the kernel. On another system the sk98lin and skge
drivers don't recognize the chipset at all (missing PCI ID?) while
the sky2 driver works perfectly for large quantities of data
transferred.

Cheers,
Kyle Moffett

2007-09-11 08:05:31

by Stephen Hemminger

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Sun, 9 Sep 2007 13:13:26 +0200
Adrian Bunk <[email protected]> wrote:

> On Sat, Sep 08, 2007 at 10:42:20PM -0400, Kyle Rose wrote:
> >
> > > You are a regular reader of linux-kernel, and therefore the sk98lin
> > > removal can hardly be a surprise for you. If you prefer whining over
> > > helping to improve the kernel that's your choice...
> > >
> > In my case the issue is simply one of practicality: I cannot go to the
> > data center 5 times per day to reboot my colo box. Therefore, I run
> > sk98lin. It's really that simple.
>
> When did you report this bug the first time?
>
> What we need is that people when testing a new kernel they plan to use
> test the new drivers *and report the bugs if they run into any*.
>
> What could we have done so that you reported your bug without removing
> the sk98lin driver?
>
> > Kyle
>
> cu
> Adrian


There are several different problems in this thread:
1. The removal of old sk98lin driver caused some users to be forced to use
skge. These users have uncovered issues with the dual port fiber based versions
of the board.
Short term: The sk98lin driver should be restored to previous state,
and the PCI table should be used to limit the usage to only fiber systems.
If Adrian doesn't do it, I'll do it when I return from Germany.
Long term: I have fiber based board (thanks ebay) on the way to resolve
skge bug.

2. Sky2 driver has it's own fiber based problems. Solve these after skge fiber.

3. Sky2 doesn't have as many workarounds for hardware problems as vendor sk98lin
driver.

2007-09-11 11:55:09

by Adrian Bunk

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote:
>
> There are several different problems in this thread:
> 1. The removal of old sk98lin driver caused some users to be forced to use
> skge. These users have uncovered issues with the dual port fiber based versions
> of the board.
> Short term: The sk98lin driver should be restored to previous state,
> and the PCI table should be used to limit the usage to only fiber systems.
> If Adrian doesn't do it, I'll do it when I return from Germany.
>...

No problem with this, but since it was Jeff's patch it should better be
him who reverts it (and he's anyway one step nearer to Linus).

But the underlying general problem still remains:

How can we get people to test and report bugs with the new drivers
before removing the old driver?

That's a question especially for the people who now had problems after
sk98lin was removed.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2007-09-11 14:28:14

by Bill Davidsen

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

Adrian Bunk wrote:
> On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote:
>
>> There are several different problems in this thread:
>> 1. The removal of old sk98lin driver caused some users to be forced to use
>> skge. These users have uncovered issues with the dual port fiber based versions
>> of the board.
>> Short term: The sk98lin driver should be restored to previous state,
>> and the PCI table should be used to limit the usage to only fiber systems.
>> If Adrian doesn't do it, I'll do it when I return from Germany.
>> ...
>>
>
> No problem with this, but since it was Jeff's patch it should better be
> him who reverts it (and he's anyway one step nearer to Linus).
>
> But the underlying general problem still remains:
>
> How can we get people to test and report bugs with the new drivers
> before removing the old driver?
>
>
Sorry for a long answer, I'm trying to provide insight on two recent cases.

Thinking back to several drivers, when e100 was new I tried it because I
had problems with eepro100 in the area of multiple cards, multiple
cables on a single card, and jumbo packets. For a while I used both,
until e100 worked where I need it. So I initially tried it because it
had features I needed, and then dropped to older driver just to avoid
having to decide.

With sk98lin, the driver worked flawlessly with all (3-4) systems, so I
had no reason to try any other. When removing sk98lin was first
proposed, I tried skge, first measurements showed it was 5-8% slower,
NOT what I want, so I went back. For me there was no reliability issue,
but I never tried it in a system with more than on NIC on the driver.
Would "it's a little slower" be a valid bug report? Or would I have
gotten "works fine for me" from people not beating it over Gbit? I
didn't try sky2 until you suggested it, and I have reported my results
previously, just stops working. Could it be my hardware? I tried it on
one system, so yes, but sk98lin works for months.
> That's a question especially for the people who now had problems after
> sk98lin was removed.

So if you want people to try a new driver, I think it really has to have
some benefits to the users, in terms of performance, reliability, or
features. "Cleaner design" doesn't motivate, and it does raise the
question of why the old driver wasn't just cleaned up. I've been doing
software for decades, I appreciate why, but users in general just want
to use their system. Which raises the question of why to delete drivers
which work for many or even most users? Testing a new kernel is no
longer a drop in a boot operation if modprobe.conf must be edited to get
the network up, and the typical user isn't going to write that shell
script to try one or the other driver.

Honestly, new drivers which offer little benefit to most users are the
exception rather than the rule, so this may a corner case I would like
to see sk98lin back in the kernel, for a while I can build my own
kernels and patch it in, but until other drivers are drop-in, I probably
won't change.

Separate but related: why keep skge and sky2? Are we going through this
again in a year? Is the benefit worth the effort?

Hope some of this is helpful.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-09-11 15:04:00

by Adrian Bunk

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote:
> Adrian Bunk wrote:
>> On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote:
>>
>>> There are several different problems in this thread:
>>> 1. The removal of old sk98lin driver caused some users to be forced to
>>> use
>>> skge. These users have uncovered issues with the dual port fiber
>>> based versions
>>> of the board. Short term: The sk98lin driver should be restored
>>> to previous state, and the PCI table should be used to limit the
>>> usage to only fiber systems.
>>> If Adrian doesn't do it, I'll do it when I return from Germany.
>>> ...
>>>
>>
>> No problem with this, but since it was Jeff's patch it should better be
>> him who reverts it (and he's anyway one step nearer to Linus).
>>
>> But the underlying general problem still remains:
>>
>> How can we get people to test and report bugs with the new drivers before
>> removing the old driver?
>>
>>
> Sorry for a long answer, I'm trying to provide insight on two recent cases.
>
> Thinking back to several drivers, when e100 was new I tried it because I
> had problems with eepro100 in the area of multiple cards, multiple cables
> on a single card, and jumbo packets. For a while I used both, until e100
> worked where I need it. So I initially tried it because it had features I
> needed, and then dropped to older driver just to avoid having to decide.
>
> With sk98lin, the driver worked flawlessly with all (3-4) systems, so I had
> no reason to try any other. When removing sk98lin was first proposed, I
> tried skge, first measurements showed it was 5-8% slower, NOT what I want,
> so I went back. For me there was no reliability issue, but I never tried it
> in a system with more than on NIC on the driver. Would "it's a little
> slower" be a valid bug report? Or would I have gotten "works fine for me"
> from people not beating it over Gbit?
>...

If you get less throughput that is a regression, and it should be
reported and fixed.

I doubt anybody would have told you otherwise.

Is this bug still present as of 2.6.23-rc6?

>> That's a question especially for the people who now had problems after
>> sk98lin was removed.
>
> So if you want people to try a new driver, I think it really has to have
> some benefits to the users, in terms of performance, reliability, or
> features. "Cleaner design" doesn't motivate, and it does raise the question
> of why the old driver wasn't just cleaned up. I've been doing software for
> decades, I appreciate why, but users in general just want to use their
> system. Which raises the question of why to delete drivers which work for
> many or even most users?

As I already explained, there is a long term advantage for all users if
there is only one driver in the kernel. Therefore all users should
switch away from obsolete drivers to the replacement drivers, and the
obsolete driver will be removed at some point in time. The only question
is how to do it.

> Testing a new kernel is no longer a drop in a boot
> operation if modprobe.conf must be edited to get the network up, and the
> typical user isn't going to write that shell script to try one or the other
> driver.

The typical user will let his distribution handle this.

And MODULE_ALIAS can also handle this.

> Honestly, new drivers which offer little benefit to most users are the
> exception rather than the rule, so this may a corner case I would like to
> see sk98lin back in the kernel, for a while I can build my own kernels and
> patch it in, but until other drivers are drop-in, I probably won't change.

That a new driver offers benefits that cause most users to switch isn't
realistic.

You mention e100 as an example - well, I'm using this driver in my
computer, but I doubt anything would be worse for me if I'd use the
obsolete eepro100 driver instead since I'm not using any of the fancy
e100 features you mentioned as advantages.

There is a long term advantage for all users if there is only one driver
in the kernel. Therefore all users should switch away from obsolete
drivers to the replacement drivers, and the obsolete driver will be
removed at some point in time. The only question is how to do it.

> Separate but related: why keep skge and sky2? Are we going through this
> again in a year? Is the benefit worth the effort?
>...

skge and sky2 support distinct hardware.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2007-09-11 22:20:42

by James Corey

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1


--- Stephen Hemminger
<[email protected]> wrote:

> On Sun, 9 Sep 2007 13:13:26 +0200
> Adrian Bunk <[email protected]> wrote:
>
> > On Sat, Sep 08, 2007 at 10:42:20PM -0400, Kyle
> Rose wrote:
> > >
> > > > You are a regular reader of linux-kernel, and
> therefore the sk98lin
> > > > removal can hardly be a surprise for you. If
> you prefer whining over
> > > > helping to improve the kernel that's your
> choice...
> > > >
> > > In my case the issue is simply one of
> practicality: I cannot go to the
> > > data center 5 times per day to reboot my colo
> box. Therefore, I run
> > > sk98lin. It's really that simple.
> >
> > When did you report this bug the first time?
> >
> > What we need is that people when testing a new
> kernel they plan to use
> > test the new drivers *and report the bugs if they
> run into any*.
> >
> > What could we have done so that you reported your
> bug without removing
> > the sk98lin driver?
> >
> > > Kyle
> >
> > cu
> > Adrian
>
>
> There are several different problems in this thread:
> 1. The removal of old sk98lin driver caused some
> users to be forced to use
> skge. These users have uncovered issues with the
> dual port fiber based versions
> of the board.
> Short term: The sk98lin driver should be
> restored to previous state,
> and the PCI table should be used to limit the
> usage to only fiber systems.
> If Adrian doesn't do it, I'll do it when I
> return from Germany.
> Long term: I have fiber based board (thanks
> ebay) on the way to resolve
> skge bug.
>
> 2. Sky2 driver has it's own fiber based problems.
> Solve these after skge fiber.
>
> 3. Sky2 doesn't have as many workarounds for
> hardware problems as vendor sk98lin
> driver.
> -


Hm, hope I didn't trigger a religious debate. When
you get to the point of working on the SKY2 driver
problem with DGE-550SX (Syskonnect SK-9S81) also
known as the "hw csum failure" issue, I'll be
glad to test a patch or take debug data. Til then,
I'll stay out of the way.

-J





____________________________________________________________________________________
Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7


2007-09-11 22:37:52

by Willy Tarreau

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Tue, Sep 11, 2007 at 05:03:57PM +0200, Adrian Bunk wrote:
> On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote:
> > So if you want people to try a new driver, I think it really has to have
> > some benefits to the users, in terms of performance, reliability, or
> > features. "Cleaner design" doesn't motivate, and it does raise the question
> > of why the old driver wasn't just cleaned up. I've been doing software for
> > decades, I appreciate why, but users in general just want to use their
> > system. Which raises the question of why to delete drivers which work for
> > many or even most users?
>
> As I already explained, there is a long term advantage for all users if
> there is only one driver in the kernel.

Not only that. You have to place the switch in its context with history.
Stephen, please correct me if I'm wrong, but sk98lin has been randomly
working for a very long time. Not 100% the driver's fault, because it
has had to workaround a lot of chips bugs. The fact that this driver
supports *all* chips in the family makes it harder to identify whether
problems are caused by the hardware or by the driver because it is
bloated with tons of if/else.

I've personally encountered random data corruption on the receive path
with PCI-E hardware with sk98lin, as well as random TX stops. Sometimes
it would require one terabyte of data, sometimes just a few hundreds
megs. On other hardware (skge now), UDP would simply stop being sent
and some TCP traffic was necessary to restart UDP! One guy at Marvell
once asked me for more information, but it was not easy to provide
much more, given the randomness of the problems!

Stephen has done an excellent (and thankless) job at restarting from
scratch, and the idea to separate the two chips was a good one IMHO.
The problem is that he might have thought that most of the bugs were
in the driver, while most of them are in the hardware, and this requires
a lot of workarounds, which do not always work the same for everybody
(I remember having tried to disable flow control with sk98lin because
it helped with sky2).

In parallel, sk98lin has improved on the vendor's site. v8 exhibited
all the problems I explained above, but v10 has fixed a lot of them,
making the new sk98lin more reliable. In parallel, sky2 and skge had
got wider acceptance and testing. The nastiest hardware bugs will
slowly surface, a good deal of driver bugs have been detected too
(and that's expected from any new driver).

It is possible that after 2 or 3 patches, a lot of the remaining
problems will suddenly vanish. But it's also possible that the driver
will still not work for 1% of people for 1 or 2 years because of some
obscure hardware combinations which trigger some obscure hardware bugs.

> Therefore all users should
> switch away from obsolete drivers to the replacement drivers, and the
> obsolete driver will be removed at some point in time. The only question
> is how to do it.

Desktop users genreally have no problem experimenting with multiple kernels
or drivers. They can report feedback too, but generally, they're not very
good at downloading alternative drivers and patching their kernel with those.

Server users cannot experiment for a long time. After 2 or 3 losses of
service, they *have* to provide a definitive solution. For some of them
when sky2 fails, it may very well be to switch over to sk98lin. Downloading
from the vendor's site and patching is not a problem for those users, but
it causes them the trouble of updating the kernel for security fixes, so
the old driver must be shipped with the kernel.

However, I remember something which might constitute a solution. In 2.4,
there's a small bug in the kbuild process on alpha. One question is always
asked during make oldconfig. Its saved value is ignored because of the way
it is computed. I don't know if we could do this with 2.6 kbuild. It would
then be nice to always set sk98lin to unset if it was set to "Y" or "M",
so that at each build, the user has to explicitly state he wants it. It's
annoying enough to give the other one a try once in a while, without causing
too much trouble to people who really have no other choice right now.

What we need with this driver is people being fed up with it, not them
being unable to use it as a last resort. Also, given that it has improved
over the last years (probably due to competition pressure from sky2/skge),
users will even less understand why there is such incentive to remove it.

Another trick for obsolete drivers would be to simply remove them from
the usual build system, but have them being available for explicit build.
Eg: make modules will not build them, but make obsolete-modules would do.

> > Testing a new kernel is no longer a drop in a boot
> > operation if modprobe.conf must be edited to get the network up, and the
> > typical user isn't going to write that shell script to try one or the other
> > driver.
>
> The typical user will let his distribution handle this.
>
> And MODULE_ALIAS can also handle this.

No system config should be edited to switch back to the alternative,
otherwise it remains in its working state.

> > Honestly, new drivers which offer little benefit to most users are the
> > exception rather than the rule, so this may a corner case I would like to
> > see sk98lin back in the kernel, for a while I can build my own kernels and
> > patch it in, but until other drivers are drop-in, I probably won't change.
>
> That a new driver offers benefits that cause most users to switch isn't
> realistic.

Desktop users are curious and have plenty of time to kill. Server users
are frightened and lazy. So I think that annoying the user slightly is
a good solution (eg: make obsolete-modules).

> You mention e100 as an example - well, I'm using this driver in my
> computer, but I doubt anything would be worse for me if I'd use the
> obsolete eepro100 driver instead since I'm not using any of the fancy
> e100 features you mentioned as advantages.

After having been happy with eepro100 for years, I discovered many problems
with its VLAN support in 2.4 (MTU, ...) for which e100 was a solution. It
was a good reason to switch. But the old e100 driver took ages to load (half
of the machine boot time), which was not satisfying. So having a new driver
load faster is another good reason to switch.

> There is a long term advantage for all users if there is only one driver
> in the kernel. Therefore all users should switch away from obsolete
> drivers to the replacement drivers, and the obsolete driver will be
> removed at some point in time. The only question is how to do it.

Hmmm we already read this paragraph above :-)

> > Separate but related: why keep skge and sky2? Are we going through this
> > again in a year? Is the benefit worth the effort?
> >...
>
> skge and sky2 support distinct hardware.

... and as such are both smaller than sk98lin which supports both.

Cheers,
Willy

2007-09-12 16:47:00

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On 9/5/07, Stephen Hemminger <[email protected]> wrote:
>
> The only known outstanding problems on 2.62.22.6 of sky2 are:
> * problems with fibre PHY based systems
> * suspend/resume issues, missing multicast reinitalization, etc.
> The previous stability problems have been addressed.

Sorry to disappoint you, but it just hung for me again.
After seeing the backport of commit c59697e06058fc2361da8cefcfa3de85ac107582 as
"sky2: restore workarounds for lost interrupts" going into 2.6.22.5 I
decided to give it another try.

First tests worked and for two days I had no trouble, but today the
network hung again, until I removed and reinserted the sky2 module.

I'm using the Gentoo kernel 2.6.22-gentoo-r6 which is based on
2.6.22.6. (All patches at
http://dev.gentoo.org/~dsd/genpatches/patches-2.6.22-7.htm )
This is as x86_64 kernel but with a 32bit userland.

My hardware:
00:00.0 Host bridge: Intel Corporation 82915G/P/GV/GL/PL/910GL Memory
Controller Hub (rev 04)
00:02.0 VGA compatible controller: Intel Corporation 82915G/GV/910GL
Integrated Graphics Controller (rev 04)
00:02.1 Display controller: Intel Corporation 82915G Integrated
Graphics Controller (rev 04)
00:1b.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) High Definition Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) PCI Express Port 1 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB UHCI #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB UHCI #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB UHCI #3 (rev 03)
00:1d.3 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB UHCI #4 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) USB2 EHCI Controller (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d3)
00:1f.0 ISA bridge: Intel Corporation 82801FB/FR (ICH6/ICH6R) LPC
Interface Bridge (rev 03)
00:1f.1 IDE interface: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6
Family) IDE Controller (rev 03)
00:1f.2 IDE interface: Intel Corporation 82801FB/FW (ICH6/ICH6W) SATA
Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family)
SMBus Controller (rev 03)
01:04.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A
IEEE-1394a-2000 Controller (PHY/Link)
01:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053
PCI-E Gigabit Ethernet Controller (rev 19)

The Marvell controller is onboard, more info:
linux ~ # lspci -vxxx -s 02:00.0
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053
PCI-E Gigabit Ethernet Controller (rev 19)
Subsystem: ASUSTeK Computer Inc. Marvell 88E8053 Gigabit
Ethernet controller PCIe (Asus)
Flags: bus master, fast devsel, latency 0, IRQ 318
Memory at cfffc000 (64-bit, non-prefetchable) [size=16K]
I/O ports at e800 [size=256]
Expansion ROM at cffc0000 [disabled] [size=128K]
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [5c] Message Signalled Interrupts: Mask- 64bit+
Queue=0/1 Enable+
Capabilities: [e0] Express Legacy Endpoint IRQ 0
00: ab 11 62 43 07 04 10 00 19 00 00 02 04 00 00 00
10: 04 c0 ff cf 00 00 00 00 01 e8 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 42 81
30: 00 00 fc cf 48 00 00 00 00 00 00 00 0a 01 00 00
40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 13
50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00
60: 0c 30 e0 fe 00 00 00 00 89 41 00 00 00 00 00 00
70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 10 00 11 00 c0 0f 00 00 00 20 1b 00 11 a4 03 00
f0: 08 00 11 10 00 00 00 00 00 00 00 00 00 00 00 00

>From /proc/interrupts
318: 230462 0 PCI-MSI-edge eth2

>From syslog:
Sep 12 11:01:27 linux [ 9580.538373] CIFS VFS: server not responding
Sep 12 11:01:27 linux [ 9580.538385] CIFS VFS: No response for cmd 50 mid 34863

Now the network was dead, I tried to restart it with ifconfig down &&
ifconfig up

Sep 12 11:03:54 linux [ 9727.917997] sky2 eth2: disabling interface
Sep 12 11:03:55 linux [ 9728.270436] sky2 eth2: enabling interface
Sep 12 11:03:55 linux [ 9728.272401] sky2 eth2: ram buffer 48K
Sep 12 11:03:56 linux [ 9730.016797] sky2 eth2: Link is up at 100
Mbps, full duplex, flow control both

As that did not help, I removed the sky2 module and reinserted it:

Sep 12 11:04:12 linux [ 9745.832197] sky2 eth2: disabling interface
Sep 12 11:04:18 linux [ 9751.197733] ACPI: PCI interrupt for device
0000:02:00.0 disabled
Sep 12 11:04:25 linux [ 9758.264714] ACPI: PCI Interrupt
0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16
Sep 12 11:04:25 linux [ 9758.264736] PCI: Setting latency timer of
device 0000:02:00.0 to 64
Sep 12 11:04:25 linux [ 9758.265409] sky2 0000:02:00.0: v1.14 addr
0xcfffc000 irq 16 Yukon-EC (0xb6) rev 2
Sep 12 11:04:25 linux [ 9758.265910] sky2 eth0: addr 00:15:f2:55:ce:f9
Sep 12 11:04:25 linux [ 9758.267754] udev: renamed network interface
eth0 to eth2
Sep 12 11:04:25 linux [ 9758.705240] sky2 eth2: enabling interface
Sep 12 11:04:25 linux [ 9758.707076] sky2 eth2: ram buffer 48K
Sep 12 11:04:27 linux [ 9760.592061] sky2 eth2: Link is up at 100
Mbps, full duplex, flow control both

Now the network was up again, but around one hour later it hung again.
Again after removing and reinserting the module it started to work
again, this time until I went home.

I switched back to the Realtek 8139, as that card works.

I can provide more info about the hardware, but I can't test any
patches, as this server is needed for work and random hangs after
hours of working are not really the nicest things to debug.

Torsten

2007-11-06 22:25:38

by Stephen Hemminger

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Sun, 9 Sep 2007 05:54:45 -0700 (PDT)
Chris Stromsoe <[email protected]> wrote:

> On Sat, 8 Sep 2007, Adrian Bunk wrote:
> > On Sat, Sep 08, 2007 at 01:44:20PM -0400, Bill Davidsen wrote:
> >
> >> Haven't tried later kernels, don't intend to, while no network is
> >> really secure, it not really useful.
> >
> > You are a regular reader of linux-kernel, and therefore the sk98lin
> > removal can hardly be a surprise for you. If you prefer whining over
> > helping to improve the kernel that's your choice...
>
> I've been trying to migrate off sk98lin to skge since earlier this year,
> without success, starting with 2.6.18 or .19.
>
> I have several of these cards in production using the sk98lin driver:
>
> fresno:~# lspci -vv -s 02:01
> 02:01.0 Ethernet controller: SysKonnect SK-9872 Gigabit Ethernet Server Adapter (SK-NET GE-ZX dual link) (rev 11)
> Subsystem: SysKonnect SK-9844 Gigabit Ethernet Server Adapter (SK-NET GE-SX dual link)
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
> Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 32 bytes
> Interrupt: pin A routed to IRQ 22
> Region 0: Memory at febfc000 (32-bit, non-prefetchable) [size=16K]
> Region 1: I/O ports at e800 [size=256]
> Expansion ROM at febc0000 [disabled] [size=128K]
> Capabilities: [48] Power Management version 1
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] Vital Product Data
>
> They are dual port SX fiber. Both ports are connected. If I do this:
>
> fresno:~# modprobe skge
> fresno:~# ip li set eth2 up
> fresno:~# ip li set eth2 down
> fresno:~# ip li set eth3 up
>
> the system locks up and I have to power cycle it. The order doesn't
> matter (if I do eth3 up/down, then eth2 up kills it).
>
> I don't have any problems with sk98lin. This works fine:
>
> fresno:~# modprobe sk98lin RlmtMode=DualNet
> fresno:~# ip li set eth2 up
> fresno:~# ip li set eth2 down
> fresno:~# ip li set eth3 up
> fresno:~# ip li set eth3 down
>
>
> I am more than happy to test various driver changes, and have tried a few
> suggested patches but nothing has worked so far. I would like to be using
> skge instead of sk98lin, but so far haven't had any success.

Please test 2.6.24-rc1 (or -rc2) because there were several fixes for skge
that made it work correctly for dual port fiber board. The worst bug in skge
was that it configured the ram buffer incorrectly.

I just submitted these for next 2.6.23.X stable release as well

--
Stephen Hemminger <[email protected]>

2007-11-07 02:06:17

by Chris Stromsoe

[permalink] [raw]
Subject: Re: sk98lin for 2.6.23-rc1

On Tue, 6 Nov 2007, Stephen Hemminger wrote:
> On Sun, 9 Sep 2007 05:54:45 -0700 (PDT)
> Chris Stromsoe <[email protected]> wrote:
>
>> On Sat, 8 Sep 2007, Adrian Bunk wrote:
>>> On Sat, Sep 08, 2007 at 01:44:20PM -0400, Bill Davidsen wrote:
>>>
>>>> Haven't tried later kernels, don't intend to, while no network is
>>>> really secure, it not really useful.
>>>
>>> You are a regular reader of linux-kernel, and therefore the sk98lin
>>> removal can hardly be a surprise for you. If you prefer whining over
>>> helping to improve the kernel that's your choice...
>>
>> I've been trying to migrate off sk98lin to skge since earlier this year,
>> without success, starting with 2.6.18 or .19.
>>
>> I have several of these cards in production using the sk98lin driver:
>>
>> fresno:~# lspci -vv -s 02:01
>> 02:01.0 Ethernet controller: SysKonnect SK-9872 Gigabit Ethernet Server Adapter (SK-NET GE-ZX dual link) (rev 11)
>> Subsystem: SysKonnect SK-9844 Gigabit Ethernet Server Adapter (SK-NET GE-SX dual link)
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
>> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
>> Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 32 bytes
>> Interrupt: pin A routed to IRQ 22
>> Region 0: Memory at febfc000 (32-bit, non-prefetchable) [size=16K]
>> Region 1: I/O ports at e800 [size=256]
>> Expansion ROM at febc0000 [disabled] [size=128K]
>> Capabilities: [48] Power Management version 1
>> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>> Status: D0 PME-Enable- DSel=0 DScale=1 PME-
>> Capabilities: [50] Vital Product Data
>>
>> They are dual port SX fiber. Both ports are connected. If I do this:
>>
>> fresno:~# modprobe skge
>> fresno:~# ip li set eth2 up
>> fresno:~# ip li set eth2 down
>> fresno:~# ip li set eth3 up
>>
>> the system locks up and I have to power cycle it. The order doesn't
>> matter (if I do eth3 up/down, then eth2 up kills it).
>>
>> I don't have any problems with sk98lin. This works fine:
>>
>> fresno:~# modprobe sk98lin RlmtMode=DualNet
>> fresno:~# ip li set eth2 up
>> fresno:~# ip li set eth2 down
>> fresno:~# ip li set eth3 up
>> fresno:~# ip li set eth3 down
>>
>>
>> I am more than happy to test various driver changes, and have tried a few
>> suggested patches but nothing has worked so far. I would like to be using
>> skge instead of sk98lin, but so far haven't had any success.
>
> Please test 2.6.24-rc1 (or -rc2) because there were several fixes for skge
> that made it work correctly for dual port fiber board. The worst bug in skge
> was that it configured the ram buffer incorrectly.
>
> I just submitted these for next 2.6.23.X stable release as well


I tested 2.6.24-rc1. This series of commands

fresno:~# modprobe skge
fresno:~# ip li set eth2 up
fresno:~# ip li set eth2 down
fresno:~# ip li set eth3 up

still hard-locks the box in the same place. Was there anything in the
-rc2 patch for skge?



-Chris