2005-04-03 05:10:49

by jmerkey

[permalink] [raw]
Subject: Linux 2.6.9 Adaptec 4 Port Starfire Sickness

With linux 2.6.9 running at 192 MB/S network loading and protocol
splitting drivers routing packets out of
a 2.6.9 device at full 100 mb/s (12.5 MB/S) simultaneously over 4 ports,
the adaptec starfire driver goes into
constant Tx FIFO reconfiguration mode and after 3-4 days of constantly
resetting the Tx FIFO window and
generating a deluge of messages such as:

ethX: PCI bus congestion, resetting Tx FIFO window to X bytes

pouring into the system log file at a rate of a dozen per minute. After
several days, the PCI bus totally locks up
and hangs the system. Need a config option to allow the starfire to
disable this feature. At very
high bus loading rates, the starfire card will completely lock the bus
after 3-4 days
of constant Tx FIFO reconfiguration at very high data rates with
protocol splitting and routing.

Jeff


2005-04-03 05:47:53

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 2.6.9 Adaptec 4 Port Starfire Sickness

Hi Jeff,

I've also experienced those messages under 2.4, but they were harmless,
and I never had a machine hang even after weeks of full load (the adapter
was mounted on a stress test machine before being used in firewalls for
months).

So I wonder how you can be sure that it is this driver which finally locks
the bus. Perhaps the system locks for any other reason (eg: race condition).
Have you tried with any other 4-port NIC (tulip or sun for example) ? Sun
QFE would be the most interesting to test as it also supports 64 bits /
66 MHz.

Regards,
Willy

On Sat, Apr 02, 2005 at 09:41:28PM -0700, jmerkey wrote:
> With linux 2.6.9 running at 192 MB/S network loading and protocol
> splitting drivers routing packets out of
> a 2.6.9 device at full 100 mb/s (12.5 MB/S) simultaneously over 4 ports,
> the adaptec starfire driver goes into
> constant Tx FIFO reconfiguration mode and after 3-4 days of constantly
> resetting the Tx FIFO window and
> generating a deluge of messages such as:
>
> ethX: PCI bus congestion, resetting Tx FIFO window to X bytes
>
> pouring into the system log file at a rate of a dozen per minute. After
> several days, the PCI bus totally locks up
> and hangs the system. Need a config option to allow the starfire to
> disable this feature. At very
> high bus loading rates, the starfire card will completely lock the bus
> after 3-4 days
> of constant Tx FIFO reconfiguration at very high data rates with
> protocol splitting and routing.
>
> Jeff
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-04-03 07:26:53

by Jeff Garzik

[permalink] [raw]
Subject: Re: Linux 2.6.9 Adaptec 4 Port Starfire Sickness

jmerkey wrote:
> With linux 2.6.9 running at 192 MB/S network loading and protocol
> splitting drivers routing packets out of
> a 2.6.9 device at full 100 mb/s (12.5 MB/S) simultaneously over 4 ports,
> the adaptec starfire driver goes into
> constant Tx FIFO reconfiguration mode and after 3-4 days of constantly
> resetting the Tx FIFO window and
> generating a deluge of messages such as:
>
> ethX: PCI bus congestion, resetting Tx FIFO window to X bytes
>
> pouring into the system log file at a rate of a dozen per minute. After
> several days, the PCI bus totally locks up
> and hangs the system. Need a config option to allow the starfire to
> disable this feature. At very
> high bus loading rates, the starfire card will completely lock the bus
> after 3-4 days
> of constant Tx FIFO reconfiguration at very high data rates with
> protocol splitting and routing.

The feature doesn't need disabling; just modify the driver to stop the
flapping.

Jeff



2005-04-03 07:28:36

by jmerkey

[permalink] [raw]
Subject: Re: Linux 2.6.9 Adaptec 4 Port Starfire Sickness


It works fine with the Intel Dual Port Pro-1000 MT adapters without
these problems. I am using testing scenarios
with Jumbo Frames as well. I am guessing the PCI bus contention is high
due to the disk I/O bandwidth and
this is causing conditions the adapter does not normally see.
Documentation states that this message should be very
rare, and not spool off into the logs at this rate.

See http://www.ibiblio.org/mdw/HOWTO/Ethernet-HOWTO-8.html

Jeff

Willy Tarreau wrote:

>Hi Jeff,
>
>I've also experienced those messages under 2.4, but they were harmless,
>and I never had a machine hang even after weeks of full load (the adapter
>was mounted on a stress test machine before being used in firewalls for
>months).
>
>So I wonder how you can be sure that it is this driver which finally locks
>the bus. Perhaps the system locks for any other reason (eg: race condition).
>Have you tried with any other 4-port NIC (tulip or sun for example) ? Sun
>QFE would be the most interesting to test as it also supports 64 bits /
>66 MHz.
>
>Regards,
>Willy
>
>On Sat, Apr 02, 2005 at 09:41:28PM -0700, jmerkey wrote:
>
>
>>With linux 2.6.9 running at 192 MB/S network loading and protocol
>>splitting drivers routing packets out of
>>a 2.6.9 device at full 100 mb/s (12.5 MB/S) simultaneously over 4 ports,
>>the adaptec starfire driver goes into
>>constant Tx FIFO reconfiguration mode and after 3-4 days of constantly
>>resetting the Tx FIFO window and
>>generating a deluge of messages such as:
>>
>>ethX: PCI bus congestion, resetting Tx FIFO window to X bytes
>>
>>pouring into the system log file at a rate of a dozen per minute. After
>>several days, the PCI bus totally locks up
>>and hangs the system. Need a config option to allow the starfire to
>>disable this feature. At very
>>high bus loading rates, the starfire card will completely lock the bus
>>after 3-4 days
>>of constant Tx FIFO reconfiguration at very high data rates with
>>protocol splitting and routing.
>>
>>Jeff
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to [email protected]
>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>
>
>

2005-04-03 07:36:39

by jmerkey

[permalink] [raw]
Subject: Re: Linux 2.6.9 Adaptec 4 Port Starfire Sickness

Jeff Garzik wrote:

> jmerkey wrote:
>
>> With linux 2.6.9 running at 192 MB/S network loading and protocol
>> splitting drivers routing packets out of
>> a 2.6.9 device at full 100 mb/s (12.5 MB/S) simultaneously over 4
>> ports, the adaptec starfire driver goes into
>> constant Tx FIFO reconfiguration mode and after 3-4 days of
>> constantly resetting the Tx FIFO window and
>> generating a deluge of messages such as:
>>
>> ethX: PCI bus congestion, resetting Tx FIFO window to X bytes
>>
>> pouring into the system log file at a rate of a dozen per minute.
>> After several days, the PCI bus totally locks up
>> and hangs the system. Need a config option to allow the starfire to
>> disable this feature. At very
>> high bus loading rates, the starfire card will completely lock the
>> bus after 3-4 days
>> of constant Tx FIFO reconfiguration at very high data rates with
>> protocol splitting and routing.
>
>
> The feature doesn't need disabling; just modify the driver to stop the
> flapping.
>
> Jeff
>
>
>
>
I am going to try to just turn off the Tx FIFO setting in the code
completely and see if this helps, not just
the message. See what happens ...

Jeff

2005-04-03 07:39:17

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 2.6.9 Adaptec 4 Port Starfire Sickness

On Sat, Apr 02, 2005 at 11:58:44PM -0700, jmerkey wrote:
>
> It works fine with the Intel Dual Port Pro-1000 MT adapters without
> these problems.

but unless I'm mistaken, there's no PCI bridge on this board, and it is
possible that the two ports share the same IRQ, that's why I suggested
trying a 4-port sun QFE or something which is more similar to the starfire.

> I am using testing scenarios
> with Jumbo Frames as well. I am guessing the PCI bus contention is high
> due to the disk I/O bandwidth and
> this is causing conditions the adapter does not normally see.

As I said, I have been saturating this card for weeks during stress tests
and although it spitted out lots of messages, it never hanged (at least on
recent 2.4 kernels, because very early 2.4 were a real pain with this one).

> Documentation states that this message should be very
> rare, and not spool off into the logs at this rate.

perhaps you have a mix of small and large frames which makes the driver
constantly change the fifo size, and this part is not handled properly ?

Willy

> See http://www.ibiblio.org/mdw/HOWTO/Ethernet-HOWTO-8.html
>
> Jeff
>
> Willy Tarreau wrote:
>
> >Hi Jeff,
> >
> >I've also experienced those messages under 2.4, but they were harmless,
> >and I never had a machine hang even after weeks of full load (the adapter
> >was mounted on a stress test machine before being used in firewalls for
> >months).
> >
> >So I wonder how you can be sure that it is this driver which finally
> >locks
> >the bus. Perhaps the system locks for any other reason (eg: race
> >condition).
> >Have you tried with any other 4-port NIC (tulip or sun for example) ? Sun
> >QFE would be the most interesting to test as it also supports 64 bits /
> >66 MHz.
> >
> >Regards,
> >Willy
> >
> >On Sat, Apr 02, 2005 at 09:41:28PM -0700, jmerkey wrote:
> >
> >
> >>With linux 2.6.9 running at 192 MB/S network loading and protocol
> >>splitting drivers routing packets out of
> >>a 2.6.9 device at full 100 mb/s (12.5 MB/S) simultaneously over 4
> >>ports, the adaptec starfire driver goes into
> >>constant Tx FIFO reconfiguration mode and after 3-4 days of constantly
> >>resetting the Tx FIFO window and
> >>generating a deluge of messages such as:
> >>
> >>ethX: PCI bus congestion, resetting Tx FIFO window to X bytes
> >>
> >>pouring into the system log file at a rate of a dozen per minute.
> >>After several days, the PCI bus totally locks up
> >>and hangs the system. Need a config option to allow the starfire to
> >>disable this feature. At very
> >>high bus loading rates, the starfire card will completely lock the bus
> >>after 3-4 days
> >>of constant Tx FIFO reconfiguration at very high data rates with
> >>protocol splitting and routing.
> >>
> >>Jeff
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> >>in
> >>the body of a message to [email protected]
> >>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at http://www.tux.org/lkml/
> >>
> >>
> >
> >
> >

2005-04-03 07:51:23

by jmerkey

[permalink] [raw]
Subject: Re: Linux 2.6.9 Adaptec 4 Port Starfire Sickness


I disabled the FIFO resetting code and am running tests. See what
happens. I am on 2.6 not
2.4 so it could be a problem there. At any rate, I will see if the
problem goes away.

Jeff

Willy Tarreau wrote:

>On Sat, Apr 02, 2005 at 11:58:44PM -0700, jmerkey wrote:
>
>
>>It works fine with the Intel Dual Port Pro-1000 MT adapters without
>>these problems.
>>
>>
>
>but unless I'm mistaken, there's no PCI bridge on this board, and it is
>possible that the two ports share the same IRQ, that's why I suggested
>trying a 4-port sun QFE or something which is more similar to the starfire.
>
>
>
>>I am using testing scenarios
>>with Jumbo Frames as well. I am guessing the PCI bus contention is high
>>due to the disk I/O bandwidth and
>>this is causing conditions the adapter does not normally see.
>>
>>
>
>As I said, I have been saturating this card for weeks during stress tests
>and although it spitted out lots of messages, it never hanged (at least on
>recent 2.4 kernels, because very early 2.4 were a real pain with this one).
>
>
>
>>Documentation states that this message should be very
>>rare, and not spool off into the logs at this rate.
>>
>>
>
>perhaps you have a mix of small and large frames which makes the driver
>constantly change the fifo size, and this part is not handled properly ?
>
>Willy
>
>
>
>>See http://www.ibiblio.org/mdw/HOWTO/Ethernet-HOWTO-8.html
>>
>>Jeff
>>
>>Willy Tarreau wrote:
>>
>>
>>
>>>Hi Jeff,
>>>
>>>I've also experienced those messages under 2.4, but they were harmless,
>>>and I never had a machine hang even after weeks of full load (the adapter
>>>was mounted on a stress test machine before being used in firewalls for
>>>months).
>>>
>>>So I wonder how you can be sure that it is this driver which finally
>>>locks
>>>the bus. Perhaps the system locks for any other reason (eg: race
>>>condition).
>>>Have you tried with any other 4-port NIC (tulip or sun for example) ? Sun
>>>QFE would be the most interesting to test as it also supports 64 bits /
>>>66 MHz.
>>>
>>>Regards,
>>>Willy
>>>
>>>On Sat, Apr 02, 2005 at 09:41:28PM -0700, jmerkey wrote:
>>>
>>>
>>>
>>>
>>>>With linux 2.6.9 running at 192 MB/S network loading and protocol
>>>>splitting drivers routing packets out of
>>>>a 2.6.9 device at full 100 mb/s (12.5 MB/S) simultaneously over 4
>>>>ports, the adaptec starfire driver goes into
>>>>constant Tx FIFO reconfiguration mode and after 3-4 days of constantly
>>>>resetting the Tx FIFO window and
>>>>generating a deluge of messages such as:
>>>>
>>>>ethX: PCI bus congestion, resetting Tx FIFO window to X bytes
>>>>
>>>>pouring into the system log file at a rate of a dozen per minute.
>>>>After several days, the PCI bus totally locks up
>>>>and hangs the system. Need a config option to allow the starfire to
>>>>disable this feature. At very
>>>>high bus loading rates, the starfire card will completely lock the bus
>>>>after 3-4 days
>>>>of constant Tx FIFO reconfiguration at very high data rates with
>>>>protocol splitting and routing.
>>>>
>>>>Jeff
>>>>-
>>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>>>in
>>>>the body of a message to [email protected]
>>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>Please read the FAQ at http://www.tux.org/lkml/
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>