2009-10-08 14:08:46

by Nick Bowler

[permalink] [raw]
Subject: [Regression, bisected] Transmit failure in et131x.

I have a PCI-express et131x card which works well with the driver in
2.6.31 but fails in 2.6.32-rc3. The card works at first (for example, I
can ping another host on the network), but 'scp' will only transfer a
few kilobytes before stalling. Once this happens, ping no longer works.
After a short while, kernel WARNINGs start being spammed to the system
log. It is possible to regain a working state by re-loading the
module.

I bisected the problem to the following commit. The WARN messages and
bisection log are at the end of this email.

356c74b401f6b05ae5d793e9d1a9ba8297b8e3ff is first bad commit
commit 356c74b401f6b05ae5d793e9d1a9ba8297b8e3ff
Author: Alan Cox <[email protected]>
Date: Thu Aug 27 11:01:57 2009 +0100

Staging: et131x: clean up DMA10/DMA4 types

Signed-off-by: Alan Cox <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

:040000 040000 5b3af63d8ab03c021a31600806a8211606351e67 9a023d8edb540ef53dba297@

Oct 5 23:37:53 aphrodite ------------[ cut here ]------------
Oct 5 23:37:53 aphrodite WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x13c/0x1fd()
Oct 5 23:37:53 aphrodite Hardware name: System Product Name
Oct 5 23:37:53 aphrodite NETDEV WATCHDOG: eth0 (et131x): transmit queue 0 timed out
Oct 5 23:37:53 aphrodite Modules linked in: nfs lockd sunrpc ipv6 hwmon_vid k8temp hwmon i2c_nforce2 snd_seq_midi snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_ens1371 snd_rawmidi snd_seq_device snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc ohci_hcd ehci_hcd usbcore nls_base sg sr_mod cdrom pata_amd et131x(C)
Oct 5 23:37:53 aphrodite Pid: 0, comm: swapper Tainted: G C 2.6.32-rc3-00053-g152b0bb #18
Oct 5 23:37:53 aphrodite Call Trace:
Oct 5 23:37:53 aphrodite <IRQ> [<ffffffff8126aad0>] ? dev_watchdog+0x13c/0x1fd
Oct 5 23:37:53 aphrodite [<ffffffff8102ec05>] warn_slowpath_common+0x77/0xa4
Oct 5 23:37:53 aphrodite [<ffffffff8102eca7>] warn_slowpath_fmt+0x64/0x66
Oct 5 23:37:53 aphrodite [<ffffffff8102c26b>] ? default_wake_function+0xd/0xf
Oct 5 23:37:53 aphrodite [<ffffffff8104150b>] ? autoremove_wake_function+0x11/0x38
Oct 5 23:37:53 aphrodite [<ffffffff8125a9d2>] ? netdev_drivername+0x43/0x4a
Oct 5 23:37:53 aphrodite [<ffffffff8126aad0>] dev_watchdog+0x13c/0x1fd
Oct 5 23:37:53 aphrodite [<ffffffff8102b076>] ? __wake_up+0x42/0x46
Oct 5 23:37:53 aphrodite [<ffffffff8103dcad>] ? __queue_work+0x80/0x84
Oct 5 23:37:53 aphrodite [<ffffffff8126a994>] ? dev_watchdog+0x0/0x1fd
Oct 5 23:37:53 aphrodite [<ffffffff810379e7>] run_timer_softirq+0x189/0x20e
Oct 5 23:37:53 aphrodite [<ffffffff81033643>] __do_softirq+0x80/0x101
Oct 5 23:37:53 aphrodite [<ffffffff8100c78a>] call_softirq+0x1a/0x24
Oct 5 23:37:53 aphrodite [<ffffffff8100dd91>] do_softirq+0x31/0x69
Oct 5 23:37:53 aphrodite [<ffffffff81033371>] irq_exit+0x34/0x7a
Oct 5 23:37:53 aphrodite [<ffffffff8101b1dc>] smp_apic_timer_interrupt+0x7a/0x86
Oct 5 23:37:53 aphrodite [<ffffffff8100c2e3>] apic_timer_interrupt+0x13/0x20
Oct 5 23:37:53 aphrodite <EOI> [<ffffffff8101173b>] ? default_idle+0x1e/0x32
Oct 5 23:37:53 aphrodite [<ffffffff81011a57>] ? c1e_idle+0xd9/0xe0
Oct 5 23:37:53 aphrodite [<ffffffff810450ee>] ? atomic_notifier_call_chain+0xf/0x11
Oct 5 23:37:53 aphrodite [<ffffffff8100ad10>] ? cpu_idle+0x3b/0x7a
Oct 5 23:37:53 aphrodite [<ffffffff812ab466>] ? rest_init+0x7a/0x7e
Oct 5 23:37:53 aphrodite [<ffffffff81426b08>] ? start_kernel+0x30a/0x317
Oct 5 23:37:53 aphrodite [<ffffffff81426295>] ? x86_64_start_reservations+0xa5/0xa9
Oct 5 23:37:53 aphrodite [<ffffffff8142637a>] ? x86_64_start_kernel+0xe1/0xe8
Oct 5 23:37:53 aphrodite ---[ end trace 08f8caa393b9f89b ]---
Oct 5 23:37:53 aphrodite et131x 0000:02:00.0: Send stuck - reset. pMpTcb->WrIndex 0, Flags 0x00000000

Oct 5 23:37:53 aphrodite ------------[ cut here ]------------
Oct 5 23:37:53 aphrodite WARNING: at kernel/irq/manage.c:842 __free_irq+0x52/0x172()
Oct 5 23:37:53 aphrodite Hardware name: System Product Name
Oct 5 23:37:53 aphrodite Trying to free IRQ 16 from IRQ context!
Oct 5 23:37:53 aphrodite Modules linked in: nfs lockd sunrpc ipv6 hwmon_vid k8temp hwmon i2c_nforce2 snd_seq_midi snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_ens1371 snd_rawmidi snd_seq_device snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc ohci_hcd ehci_hcd usbcore nls_base sg sr_mod cdrom pata_amd et131x(C)
Oct 5 23:37:53 aphrodite Pid: 0, comm: swapper Tainted: G WC 2.6.32-rc3-00053-g152b0bb #18
Oct 5 23:37:53 aphrodite Call Trace:
Oct 5 23:37:53 aphrodite <IRQ> [<ffffffff8105821a>] ? __free_irq+0x52/0x172
Oct 5 23:37:53 aphrodite [<ffffffff8102ec05>] warn_slowpath_common+0x77/0xa4
Oct 5 23:37:53 aphrodite [<ffffffff8102eca7>] warn_slowpath_fmt+0x64/0x66
Oct 5 23:37:53 aphrodite [<ffffffff812b1c62>] ? printk+0x67/0x6d
Oct 5 23:37:53 aphrodite [<ffffffff8105821a>] __free_irq+0x52/0x172
Oct 5 23:37:53 aphrodite [<ffffffff81058377>] free_irq+0x3d/0x63
Oct 5 23:37:53 aphrodite [<ffffffffa00064b2>] et131x_close+0x50/0x5f [et131x]
Oct 5 23:37:53 aphrodite [<ffffffffa0006693>] et131x_tx_timeout+0x125/0x160 [et131x]
Oct 5 23:37:53 aphrodite [<ffffffff8102eca7>] ? warn_slowpath_fmt+0x64/0x66
Oct 5 23:37:53 aphrodite [<ffffffff8102c26b>] ? default_wake_function+0xd/0xf
Oct 5 23:37:53 aphrodite [<ffffffff8104150b>] ? autoremove_wake_function+0x11/0x38
Oct 5 23:37:53 aphrodite [<ffffffff8126aae8>] dev_watchdog+0x154/0x1fd
Oct 5 23:37:53 aphrodite [<ffffffff8102b076>] ? __wake_up+0x42/0x46
Oct 5 23:37:53 aphrodite [<ffffffff8103dcad>] ? __queue_work+0x80/0x84
Oct 5 23:37:53 aphrodite [<ffffffff8126a994>] ? dev_watchdog+0x0/0x1fd
Oct 5 23:37:53 aphrodite [<ffffffff810379e7>] run_timer_softirq+0x189/0x20e
Oct 5 23:37:53 aphrodite [<ffffffff81033643>] __do_softirq+0x80/0x101
Oct 5 23:37:53 aphrodite [<ffffffff8100c78a>] call_softirq+0x1a/0x24
Oct 5 23:37:53 aphrodite [<ffffffff8100dd91>] do_softirq+0x31/0x69
Oct 5 23:37:53 aphrodite [<ffffffff81033371>] irq_exit+0x34/0x7a
Oct 5 23:37:53 aphrodite [<ffffffff8101b1dc>] smp_apic_timer_interrupt+0x7a/0x86
Oct 5 23:37:53 aphrodite [<ffffffff8100c2e3>] apic_timer_interrupt+0x13/0x20
Oct 5 23:37:53 aphrodite <EOI> [<ffffffff8101173b>] ? default_idle+0x1e/0x32
Oct 5 23:37:53 aphrodite [<ffffffff81011a57>] ? c1e_idle+0xd9/0xe0
Oct 5 23:37:53 aphrodite [<ffffffff810450ee>] ? atomic_notifier_call_chain+0xf/0x11
Oct 5 23:37:53 aphrodite [<ffffffff8100ad10>] ? cpu_idle+0x3b/0x7a
Oct 5 23:37:53 aphrodite [<ffffffff812ab466>] ? rest_init+0x7a/0x7e
Oct 5 23:37:53 aphrodite [<ffffffff81426b08>] ? start_kernel+0x30a/0x317
Oct 5 23:37:53 aphrodite [<ffffffff81426295>] ? x86_64_start_reservations+0xa5/0xa9
Oct 5 23:37:53 aphrodite [<ffffffff8142637a>] ? x86_64_start_kernel+0xe1/0xe8
Oct 5 23:37:53 aphrodite ---[ end trace 08f8caa393b9f89c ]---
Oct 5 23:37:54 aphrodite et131x 0000:02:00.0: Send stuck - reset. pMpTcb->WrIndex 0, Flags 0x00000000

The second WARNING is continuously spammed at a rate of about one per second.

git bisect start
# bad: [0eca52a92735f43462165efe00a7e394345fb38e] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide-2.6
git bisect bad 0eca52a92735f43462165efe00a7e394345fb38e
# good: [74fca6a42863ffacaf7ba6f1936a9f228950f657] Linux 2.6.31
git bisect good 74fca6a42863ffacaf7ba6f1936a9f228950f657
# good: [910d7b31f598a1ce7a0325889537aa993ed1ce1c] Staging: hv: remove include/HvHalApi.h
git bisect good 910d7b31f598a1ce7a0325889537aa993ed1ce1c
# bad: [43368e74d126f8e874fce2c248f094edfcf0baa6] MAINTAINERS: acpi: add 'include/acpi'
git bisect bad 43368e74d126f8e874fce2c248f094edfcf0baa6
# bad: [fa877c71e2136bd682b45022c96d5e073ced9f58] Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
git bisect bad fa877c71e2136bd682b45022c96d5e073ced9f58
# bad: [5bf9cbef9972f851172391a37261b12bba63f733] Blackfin: update ftrace for latest toolchain
git bisect bad 5bf9cbef9972f851172391a37261b12bba63f733
# good: [76efa5e34ed81587e26e483a624dc7f72efb389a] Staging: pohmelfs: fix atomic type spew
git bisect good 76efa5e34ed81587e26e483a624dc7f72efb389a
# good: [0950efd1a1490e869d19ec631eed75ef57772f8b] Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/pcmcia-2.6
git bisect good 0950efd1a1490e869d19ec631eed75ef57772f8b
# bad: [7ea61767e41e2baedd6a968d13f56026522e1207] Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
git bisect bad 7ea61767e41e2baedd6a968d13f56026522e1207
# bad: [ecdfa44610fa18678c3dd481af75368b9800c6c7] Staging: add Realtek 8192 PCI wireless driver
git bisect bad ecdfa44610fa18678c3dd481af75368b9800c6c7
# good: [13071fded6df0979d50307a463be819377fc2822] Staging: et131x: eeprom remove features
git bisect good 13071fded6df0979d50307a463be819377fc2822
# bad: [e266b2022209a2bc389c1cd6b809395c67671a92] Staging: et131x: kill MSI type
git bisect bad e266b2022209a2bc389c1cd6b809395c67671a92
# good: [f2c98d27b8e88cd17bb3e77f8fccf70f8d2ebd2f] Staging: et131x: clean up PM_CSR_t
git bisect good f2c98d27b8e88cd17bb3e77f8fccf70f8d2ebd2f
# bad: [f6b35d66cf3284fd76cd3b7dd170630235ce304c] Staging: et131x: clean up MP_FLAG macros
git bisect bad f6b35d66cf3284fd76cd3b7dd170630235ce304c
# bad: [356c74b401f6b05ae5d793e9d1a9ba8297b8e3ff] Staging: et131x: clean up DMA10/DMA4 types
git bisect bad 356c74b401f6b05ae5d793e9d1a9ba8297b8e3ff

--
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)


2009-10-08 21:31:23

by Alan

[permalink] [raw]
Subject: Re: [Regression, bisected] Transmit failure in et131x.

> I bisected the problem to the following commit. The WARN messages and
> bisection log are at the end of this email.
>
> 356c74b401f6b05ae5d793e9d1a9ba8297b8e3ff is first bad commit
> commit 356c74b401f6b05ae5d793e9d1a9ba8297b8e3ff
> Author: Alan Cox <[email protected]>
> Date: Thu Aug 27 11:01:57 2009 +0100

Thanks. I'll go through that patch again see what I broke on your card.
Probably a wrapping bug with a long queue that my box for some reason
didn't manage to hit.

> Oct 5 23:37:53 aphrodite WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x13c/0x1fd()
> Oct 5 23:37:53 aphrodite Hardware name: System Product Name
> Oct 5 23:37:53 aphrodite NETDEV WATCHDOG: eth0 (et131x): transmit queue 0 timed out

That is the important bit. The rest is fall out from the fact the driver
doesn't recover properly when it fails.

Alan

2009-10-09 10:19:39

by Alan

[permalink] [raw]
Subject: Re: [Regression, bisected] Transmit failure in et131x.

Please test the following if you would.

If it doesn't work see if you can tell from the other end and tcpdump if the receiver or sender side dies, and let me know if you are using 10, 100 or Gbit and if you have the mtu set over 1500.


et131x: Correct WRAP bit handling

From: Alan Cox <[email protected]>

add_10bit loses the existing wrap value

Signed-off-by: Alan Cox <[email protected]>
---

drivers/staging/et131x/et1310_rx.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)


diff --git a/drivers/staging/et131x/et1310_rx.c b/drivers/staging/et131x/et1310_rx.c
index 5356d0f..4c4555d 100644
--- a/drivers/staging/et131x/et1310_rx.c
+++ b/drivers/staging/et131x/et1310_rx.c
@@ -1074,12 +1074,20 @@ void et131x_handle_recv_interrupt(struct et131x_adapter *etdev)

static inline u32 bump_fbr(u32 *fbr, u32 limit)
{
- u32 v = *fbr;
- add_10bit(&v, 1);
- if (v > limit)
- v = (*fbr & ~ET_DMA10_MASK) ^ ET_DMA10_WRAP;
- *fbr = v;
- return v;
+ u32 v = *fbr;
+ v++;
+ /* This works for all cases where limit < 1024. The 1023 case
+ works because 1023++ is 1024 which means the if condition is not
+ taken but the carry of the bit into the wrap bit toggles the wrap
+ value correctly */
+ if ((v & ET_DMA10_MASK) > limit) {
+ v &= ~ET_DMA10_MASK;
+ v ^= ET_DMA10_WRAP;
+ }
+ /* For the 1023 case */
+ v &= (ET_DMA10_MASK|ET_DMA10_WRAP);
+ *fbr = v;
+ return v;
}

/**

2009-10-11 23:09:21

by Nick Bowler

[permalink] [raw]
Subject: Re: [Regression, bisected] Transmit failure in et131x.

Alan Cox <alan <at> lxorguk.ukuu.org.uk> writes:
> Please test the following if you would.
>
> If it doesn't work see if you can tell from the other end and tcpdump
> if the receiver or sender side dies, and let me know if you are using
> 10, 100 or Gbit and if you have the mtu set over 1500.

Thanks a lot for your help. Alas, this does not resolve the issue.
Using tcpdump indicates that receive continues to work while transmit
fails.

I was using gigabit, but the problem persists when I run at 100mbit.
The MTU setting remains at the default 1500 bytes.

In case it's useful, here's the lspci -nv output for my card:

02:00.0 0200: 11c1:ed00 (rev 02)
Subsystem: 11c1:ed00
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at fd400000 (64-bit, non-prefetchable) [size=2M]
Expansion ROM at fd7e0000 [disabled] [size=128K]
Capabilities: [40] Power Management version 2
Capabilities: [48] Express Endpoint, MSI 00
Capabilities: [5c] Message Signalled Interrupts: Mask- 64bit+ Count=1/1
Enable-
Capabilities: [100] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq+ ACSVoil-
UEMsk: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+
MalfTLP+ ECRC+ UnsupReq+ ACSVoil-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
MalfTLP+ ECRC- UnsupReq- ACSVoil-
CESta: RxErr- BadTLP- BadDLLP+ Rollover- Timeout+ NonFatalErr-
CESta: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr-
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [130] Virtual Channel <?>
Capabilities: [14c] Device Serial Number d9-29-03-fe-ff-3b-13-02
Kernel driver in use: et131x
Kernel modules: et131x

2009-10-12 10:35:41

by Alan

[permalink] [raw]
Subject: Re: [Regression, bisected] Transmit failure in et131x.

> Thanks a lot for your help. Alas, this does not resolve the issue.
> Using tcpdump indicates that receive continues to work while transmit
> fails.
>
> I was using gigabit, but the problem persists when I run at 100mbit.
> The MTU setting remains at the default 1500 bytes.

et131x: Fix the add_10bit macro

From: Alan Cox <[email protected]>

Duh.. we need to preserve the wrap bit when adding.

Signed-off-by: Alan Cox <[email protected]>
---

drivers/staging/et131x/et1310_address_map.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/drivers/staging/et131x/et1310_address_map.h b/drivers/staging/et131x/et1310_address_map.h
index 2273926..d71e04a 100644
--- a/drivers/staging/et131x/et1310_address_map.h
+++ b/drivers/staging/et131x/et1310_address_map.h
@@ -223,7 +223,7 @@ typedef union _TXDMA_PR_NUM_DES_t {

extern inline void add_10bit(u32 *v, int n)
{
- *v = INDEX10(*v + n);
+ *v = INDEX10(*v + n) | (*v & ET_DMA10_WRAP);
}

/*

2009-10-12 14:20:53

by Nick Bowler

[permalink] [raw]
Subject: Re: [Regression, bisected] Transmit failure in et131x.

Alan Cox <alan <at> lxorguk.ukuu.org.uk> writes:
> > Thanks a lot for your help. Alas, this does not resolve the issue.
> > Using tcpdump indicates that receive continues to work while transmit
> > fails.
> >
> > I was using gigabit, but the problem persists when I run at 100mbit.
> > The MTU setting remains at the default 1500 bytes.
>
> et131x: Fix the add_10bit macro

That seems to have done the trick (in conjunction with the earlier patch,
at least) -- thanks!

Reported-and-tested-by: Nick Bowler <[email protected]>