2008-02-25 20:35:35

by Marin Mitov

[permalink] [raw]
Subject: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

Hi all,

I experience very rare freezes at heavy outbound traffic
(sending ~4GB DVD image to another host(s) on the same LAN)
using skge driver (NIC on the mobo) as well as (recently tested)
using rtl8139 or dmfe NICs on the PCI bus. There is a single
switch between them (tested with another one just to exclude
a faulty switch).

skge <--> Marvell 88E8001 chip
8139too <--> Realtek 8136B chip
dmfe <--> Davicom DM9102 chip

Symptoms are similar: tx timeouts and no more net activity.
KDE desktop works, computational programs - work, the machine
is usable, but cannot ping, nor can be ping-ed anymore.
rmmod && modprobe the respective modules repairs the problem.
Simple surfing/e-mailing from it do not trigger the problem.

The machine is used as LTSP server for old PCs (as X terminals)
(mostly outbound traffic) and is not usable as such due to this
problem.

The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).

As far as this happens with 3 different NICs/drivers could it be
a problem in the (common for all of them) networking subsystem?

As far as many persons are working on this machine only limited
testing could be done.

Thank you in advance for your suggestions, help (and patches).

Regards.

Marin Mitov


2008-02-25 20:53:18

by Jeff Garzik

[permalink] [raw]
Subject: Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

Marin Mitov wrote:
> Hi all,
>
> I experience very rare freezes at heavy outbound traffic
> (sending ~4GB DVD image to another host(s) on the same LAN)
> using skge driver (NIC on the mobo) as well as (recently tested)
> using rtl8139 or dmfe NICs on the PCI bus. There is a single
> switch between them (tested with another one just to exclude
> a faulty switch).
>
> skge <--> Marvell 88E8001 chip
> 8139too <--> Realtek 8136B chip
> dmfe <--> Davicom DM9102 chip
>
> Symptoms are similar: tx timeouts and no more net activity.
> KDE desktop works, computational programs - work, the machine
> is usable, but cannot ping, nor can be ping-ed anymore.
> rmmod && modprobe the respective modules repairs the problem.
> Simple surfing/e-mailing from it do not trigger the problem.
>
> The machine is used as LTSP server for old PCs (as X terminals)
> (mostly outbound traffic) and is not usable as such due to this
> problem.
>
> The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).
>
> As far as this happens with 3 different NICs/drivers could it be
> a problem in the (common for all of them) networking subsystem?

A TX timeout (like hardware timeouts, in general) is a very generic
behavior, with many causes.

In general, when you see timeouts with varied hardware and drivers,
you're almost always dealing with a problem with interrupt delivery, or
a generic system problem, rather than bugs in the network stack or all
three drivers.

Jeff


2008-02-25 21:34:30

by Marin Mitov

[permalink] [raw]
Subject: Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

On Monday 25 February 2008 10:53:01 pm you wrote:
> Marin Mitov wrote:
> > Hi all,
> >
> > I experience very rare freezes at heavy outbound traffic
> > (sending ~4GB DVD image to another host(s) on the same LAN)
> > using skge driver (NIC on the mobo) as well as (recently tested)
> > using rtl8139 or dmfe NICs on the PCI bus. There is a single
> > switch between them (tested with another one just to exclude
> > a faulty switch).
> >
> > skge <--> Marvell 88E8001 chip
> > 8139too <--> Realtek 8136B chip
> > dmfe <--> Davicom DM9102 chip
> >
> > Symptoms are similar: tx timeouts and no more net activity.
> > KDE desktop works, computational programs - work, the machine
> > is usable, but cannot ping, nor can be ping-ed anymore.
> > rmmod && modprobe the respective modules repairs the problem.
> > Simple surfing/e-mailing from it do not trigger the problem.
> >
> > The machine is used as LTSP server for old PCs (as X terminals)
> > (mostly outbound traffic) and is not usable as such due to this
> > problem.
> >
> > The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).
> >
> > As far as this happens with 3 different NICs/drivers could it be
> > a problem in the (common for all of them) networking subsystem?
>
> A TX timeout (like hardware timeouts, in general) is a very generic
> behavior, with many causes.
>
> In general, when you see timeouts with varied hardware and drivers,
> you're almost always dealing with a problem with interrupt delivery, or

All the drivers are using #INTA on PCI bus (no MSI/MSI-X).

"problem with interrupt delivery" - you suspect interrupts incorrectly
disabled (lost) in the drivers or faulty hardware(motherboard)?

> a generic system problem, rather than bugs in the network stack or all

"a generic system problem" - bad config or faulty hardware(motherboard)?

Where I should look for the problem?

Just for info: the system is very stable - uptime (if no power outages) could
be a month or more (rebooting for kernel changes or updates).

Marin Mitov

> three drivers.
>
> Jeff
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2008-02-25 21:42:34

by Stephen Hemminger

[permalink] [raw]
Subject: Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

On Mon, 25 Feb 2008 23:36:06 +0200
Marin Mitov <[email protected]> wrote:

> On Monday 25 February 2008 10:53:01 pm you wrote:
> > Marin Mitov wrote:
> > > Hi all,
> > >
> > > I experience very rare freezes at heavy outbound traffic
> > > (sending ~4GB DVD image to another host(s) on the same LAN)
> > > using skge driver (NIC on the mobo) as well as (recently tested)
> > > using rtl8139 or dmfe NICs on the PCI bus. There is a single
> > > switch between them (tested with another one just to exclude
> > > a faulty switch).
> > >
> > > skge <--> Marvell 88E8001 chip
> > > 8139too <--> Realtek 8136B chip
> > > dmfe <--> Davicom DM9102 chip
> > >
> > > Symptoms are similar: tx timeouts and no more net activity.
> > > KDE desktop works, computational programs - work, the machine
> > > is usable, but cannot ping, nor can be ping-ed anymore.
> > > rmmod && modprobe the respective modules repairs the problem.
> > > Simple surfing/e-mailing from it do not trigger the problem.
> > >
> > > The machine is used as LTSP server for old PCs (as X terminals)
> > > (mostly outbound traffic) and is not usable as such due to this
> > > problem.
> > >
> > > The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).
> > >
> > > As far as this happens with 3 different NICs/drivers could it be
> > > a problem in the (common for all of them) networking subsystem?
> >
> > A TX timeout (like hardware timeouts, in general) is a very generic
> > behavior, with many causes.
> >
> > In general, when you see timeouts with varied hardware and drivers,
> > you're almost always dealing with a problem with interrupt delivery, or
>
> All the drivers are using #INTA on PCI bus (no MSI/MSI-X).
>
> "problem with interrupt delivery" - you suspect interrupts incorrectly
> disabled (lost) in the drivers or faulty hardware(motherboard)?
>
> > a generic system problem, rather than bugs in the network stack or all
>
> "a generic system problem" - bad config or faulty hardware(motherboard)?
>
> Where I should look for the problem?
>
> Just for info: the system is very stable - uptime (if no power outages) could
> be a month or more (rebooting for kernel changes or updates).
>
> Marin Mitov

Make sure the interrupt is showing up as level triggered in /proc/interrupts.
The BIOS may be configuring it as edge-triggered and that won't work with
Ethernet drivers that use NAPI.

2008-02-25 22:08:19

by Marin Mitov

[permalink] [raw]
Subject: Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

Hi Stephen,

> Make sure the interrupt is showing up as level triggered in
> /proc/interrupts. The BIOS may be configuring it as edge-triggered and that
> won't work with Ethernet drivers that use NAPI.

for: skge <--> Marvell 88E8001 chip
cat /proc/interrupts gives (AMD64 X2 SMP):
CPU0 CPU1
21: 11691000 11933174 IO-APIC-fasteoi eth0

It is neither IO-APIC-edge, nor IO-APIC-level.

Could it be the problem?

Marin Mitov

2008-02-25 22:59:07

by Stephen Hemminger

[permalink] [raw]
Subject: Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

On Tue, 26 Feb 2008 00:09:46 +0200
Marin Mitov <[email protected]> wrote:

> Hi Stephen,
>
> > Make sure the interrupt is showing up as level triggered in
> > /proc/interrupts. The BIOS may be configuring it as edge-triggered and that
> > won't work with Ethernet drivers that use NAPI.
>
> for: skge <--> Marvell 88E8001 chip
> cat /proc/interrupts gives (AMD64 X2 SMP):
> CPU0 CPU1
> 21: 11691000 11933174 IO-APIC-fasteoi eth0
>
> It is neither IO-APIC-edge, nor IO-APIC-level.
>
> Could it be the problem?
>
> Marin Mitov

No. that isn't the problem.

2008-03-12 11:39:34

by Marin Mitov

[permalink] [raw]
Subject: Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

On Monday 25 February 2008 10:53:01 pm you wrote:
> > As far as this happens with 3 different NICs/drivers could it be
> > a problem in the (common for all of them) networking subsystem?
>
> A TX timeout (like hardware timeouts, in general) is a very generic
> behavior, with many causes.
>
> In general, when you see timeouts with varied hardware and drivers,
> you're almost always dealing with a problem with interrupt delivery, or
> a generic system problem, rather than bugs in the network stack or all
> three drivers.

Well, this gave me a direction of research.

Using printk in various parts of skge driver, as well as modifying it to
collect different statistics (used via ethtool -S eth0), the following observations
had been made when it freezes:

1. interrupts are generated (status register shows there are pending
interrupts and they are NOT masked), but irq_handler is NOT invoked.

2. Looking on the cat /proc/interrups shows that when skge is working
both CPUs receive any IRQs. When skge freezes NO CPU receives skge's
interrupts, CPU[0] receives any others IRQs, but skge's, CPU[1] do not
receive any IRQ above the line (see bellow), but receives LOC: and RES:
below the line.
#cat /proc/interrups
CPU0 CPU1
0: 85 1 IO-APIC-edge timer
1: 34078 9 IO-APIC-edge i8042
6: 1 4 IO-APIC-edge floppy
7: 216 1 IO-APIC-edge parport0
8: 0 1 IO-APIC-edge rtc
9: 0 0 IO-APIC-fasteoi acpi
12: 893003 1390080 IO-APIC-edge i8042
14: 59682 286628 IO-APIC-edge ide0
15: 5458527 12 IO-APIC-edge ide1
16: 60547054 1 IO-APIC-fasteoi mga@pci:0000:01:00.0
17: 1634623 914447 IO-APIC-fasteoi sata_via
18: 7768 7 IO-APIC-fasteoi sata_promise
19: 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5
20: 535380 1 IO-APIC-fasteoi VIA8237
21: 30780380 31448992 IO-APIC-fasteoi eth0
---------line added by me----------------------------------
NMI: 0 0 Non-maskable interrupts
LOC: 154311126 154736178 Local timer interrupts
RES: 1325239 2423719 Rescheduling interrupts
CAL: 40893 456 function call interrupts
TLB: 52651 29184 TLB shootdowns
TRM: 0 0 Thermal event interrupts
SPU: 0 0 Spurious interrupts
ERR: 0
MIS: 0

That looks like IRQs are somehow disabled (at IO-APIC/LAPIC?)
at some priority and bellow.

Here is the place to say that after freezing, ifconfig down/up (+routing info)
does NOT solve the problem, while rmmod/modprobe the driver, makes it work
again.

So, I moved the functions request_irq()/free_irq() from driver's probe()/release()
methods to open()/stop() methods. Thus modified, when skge freezes,
ifconfig down/up makes it work again (no need to rmmod/modprobe).

That makes me think that somehow skge's IRQ is disabled OUT of the driver
and free_irq()/request_irq() clears the problem. Am I wrong?

Could it be possible? How could this happen?

Any comments/suggestions/patches wellcome.

Regards

Marin Mitov