Hi,
on this Ideapad S10 the onboard Broadcom BCM5906M prints the warning
below, once. From then on, the "transmit timed out, resetting" message
repeats, every now and then.
This laptop is mounting 2 readonly NFS shares from a box in the same LAN
and when scanning lots of files on these NFS shares, the transmit timeouts
occur more often, I think. When there's sequential traffic (i.e. reading
larger files from the NFS shares), fewer warnings occur. But this is just
manual observation, I haven't been able to reproduce this reliably.
However, there's constant traffic on the device (maybe ~700KB/s both tx
and rx), so the messages occur pretty regularly.
I have reported the error against the Fedora 17 kernel [0] but it happens
with a vanilla 3.4.0 too[1] - check out for full dmesg, .config and more.
I had a similar issue a while ago[2] and almost forgot about them. The
laptop ran Ubuntu 10.04 (2.6.32) since then and the problem was gone, so
I'd say 2.6.32 fixed it. Now the same laptop switched to Fedora, kernel
3.3.4 and the problem seems to be back again.
I'll try running with sg=off, as Matt suggested in [3] and report back.
Thanks,
Christian.
[0] https://bugzilla.redhat.com/show_bug.cgi?id=825123
[1] http://nerdbynature.de/bits/3.4.0/tg3/
[2] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00004.html
[3] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00317.html
------------[ cut here ]------------
WARNING: at /opt/home/chrisk/dev/linux-2.6-git/net/sched/sch_generic.c:255
dev_watchdog+0x1cc/0x1e0()
Hardware name: Lenovo
NETDEV WATCHDOG: p2p1 (tg3): transmit queue 0 timed out
Modules linked in: acpi_cpufreq mperf freq_table nfs lockd sunrpc b43
mac80211 cfg80211 ssb coretemp hwmon usb_storage [last unloaded: scsi_wait_scan]
Pid: 685, comm: FahCore_78 Not tainted 3.4.0-10151-g4fc3acf #8
Call Trace:
[<c102b299>] ? warn_slowpath_common+0x79/0xb0
[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
[<c102b374>] ? warn_slowpath_fmt+0x34/0x40
[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
[<c12d5320>] ? pfifo_fast_dequeue+0xe0/0xe0
[<c1035cf1>] ? run_timer_softirq+0xd1/0x1d0
[<c1031615>] ? __do_softirq+0x75/0x100
[<c10315a0>] ? remote_softirq_receive+0x20/0x20
<IRQ> [<c10318a6>] ? irq_exit+0x66/0x90
[<c101b8d9>] ? smp_apic_timer_interrupt+0x59/0x90
[<c1360b35>] ? apic_timer_interrupt+0x31/0x38
[<c1360000>] ? rt_mutex_trylock+0x70/0x70
---[ end trace 9de668a859ee5d6c ]---
tg3 0000:02:00.0: p2p1: transmit timed out, resetting
--
BOFH excuse #438:
sticky bit has come loose
On Mon, 4 Jun 2012 at 16:14, Christian Kujau wrote:
> Hi,
>
> on this Ideapad S10 the onboard Broadcom BCM5906M prints the warning
> below, once. From then on, the "transmit timed out, resetting" message
> repeats, every now and then.
>
> This laptop is mounting 2 readonly NFS shares from a box in the same LAN
> and when scanning lots of files on these NFS shares, the transmit timeouts
> occur more often, I think. When there's sequential traffic (i.e. reading
> larger files from the NFS shares), fewer warnings occur. But this is just
> manual observation, I haven't been able to reproduce this reliably.
> However, there's constant traffic on the device (maybe ~700KB/s both tx
> and rx), so the messages occur pretty regularly.
>
> I have reported the error against the Fedora 17 kernel [0] but it happens
> with a vanilla 3.4.0 too[1] - check out for full dmesg, .config and more.
>
> I had a similar issue a while ago[2] and almost forgot about them. The
> laptop ran Ubuntu 10.04 (2.6.32) since then and the problem was gone, so
> I'd say 2.6.32 fixed it. Now the same laptop switched to Fedora, kernel
> 3.3.4 and the problem seems to be back again.
>
> I'll try running with sg=off, as Matt suggested in [3] and report back.
sg=off seems to help, no errors since I disabled it yesterday.
Any thoughts on this issue?
Christian.
> Thanks,
> Christian.
>
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=825123
> [1] http://nerdbynature.de/bits/3.4.0/tg3/
> [2] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00004.html
> [3] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00317.html
>
> ------------[ cut here ]------------
> WARNING: at /opt/home/chrisk/dev/linux-2.6-git/net/sched/sch_generic.c:255
> dev_watchdog+0x1cc/0x1e0()
> Hardware name: Lenovo
> NETDEV WATCHDOG: p2p1 (tg3): transmit queue 0 timed out
> Modules linked in: acpi_cpufreq mperf freq_table nfs lockd sunrpc b43
> mac80211 cfg80211 ssb coretemp hwmon usb_storage [last unloaded: scsi_wait_scan]
> Pid: 685, comm: FahCore_78 Not tainted 3.4.0-10151-g4fc3acf #8
> Call Trace:
> [<c102b299>] ? warn_slowpath_common+0x79/0xb0
> [<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> [<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> [<c102b374>] ? warn_slowpath_fmt+0x34/0x40
> [<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> [<c12d5320>] ? pfifo_fast_dequeue+0xe0/0xe0
> [<c1035cf1>] ? run_timer_softirq+0xd1/0x1d0
> [<c1031615>] ? __do_softirq+0x75/0x100
> [<c10315a0>] ? remote_softirq_receive+0x20/0x20
> <IRQ> [<c10318a6>] ? irq_exit+0x66/0x90
> [<c101b8d9>] ? smp_apic_timer_interrupt+0x59/0x90
> [<c1360b35>] ? apic_timer_interrupt+0x31/0x38
> [<c1360000>] ? rt_mutex_trylock+0x70/0x70
> ---[ end trace 9de668a859ee5d6c ]---
> tg3 0000:02:00.0: p2p1: transmit timed out, resetting
>
>
> --
> BOFH excuse #438:
>
> sticky bit has come loose
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
BOFH excuse #18:
excess surge protection
I'm attempting to reproduce this in our lab. In the meantime,
the latest revisions of the driver output a register dump and some
additional information when transmit timeouts happen. It would be
useful to see that data. Would it be possible to try a the latest
kernels and get this information?
On Mon, Jun 04, 2012 at 04:14:30PM -0700, Christian Kujau wrote:
> Hi,
>
> on this Ideapad S10 the onboard Broadcom BCM5906M prints the warning
> below, once. From then on, the "transmit timed out, resetting" message
> repeats, every now and then.
>
> This laptop is mounting 2 readonly NFS shares from a box in the same LAN
> and when scanning lots of files on these NFS shares, the transmit timeouts
> occur more often, I think. When there's sequential traffic (i.e. reading
> larger files from the NFS shares), fewer warnings occur. But this is just
> manual observation, I haven't been able to reproduce this reliably.
> However, there's constant traffic on the device (maybe ~700KB/s both tx
> and rx), so the messages occur pretty regularly.
>
> I have reported the error against the Fedora 17 kernel [0] but it happens
> with a vanilla 3.4.0 too[1] - check out for full dmesg, .config and more.
>
> I had a similar issue a while ago[2] and almost forgot about them. The
> laptop ran Ubuntu 10.04 (2.6.32) since then and the problem was gone, so
> I'd say 2.6.32 fixed it. Now the same laptop switched to Fedora, kernel
> 3.3.4 and the problem seems to be back again.
>
> I'll try running with sg=off, as Matt suggested in [3] and report back.
>
> Thanks,
> Christian.
>
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=825123
> [1] http://nerdbynature.de/bits/3.4.0/tg3/
> [2] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00004.html
> [3] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00317.html
>
> ------------[ cut here ]------------
> WARNING: at /opt/home/chrisk/dev/linux-2.6-git/net/sched/sch_generic.c:255
> dev_watchdog+0x1cc/0x1e0()
> Hardware name: Lenovo
> NETDEV WATCHDOG: p2p1 (tg3): transmit queue 0 timed out
> Modules linked in: acpi_cpufreq mperf freq_table nfs lockd sunrpc b43
> mac80211 cfg80211 ssb coretemp hwmon usb_storage [last unloaded: scsi_wait_scan]
> Pid: 685, comm: FahCore_78 Not tainted 3.4.0-10151-g4fc3acf #8
> Call Trace:
> [<c102b299>] ? warn_slowpath_common+0x79/0xb0
> [<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> [<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> [<c102b374>] ? warn_slowpath_fmt+0x34/0x40
> [<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> [<c12d5320>] ? pfifo_fast_dequeue+0xe0/0xe0
> [<c1035cf1>] ? run_timer_softirq+0xd1/0x1d0
> [<c1031615>] ? __do_softirq+0x75/0x100
> [<c10315a0>] ? remote_softirq_receive+0x20/0x20
> <IRQ> [<c10318a6>] ? irq_exit+0x66/0x90
> [<c101b8d9>] ? smp_apic_timer_interrupt+0x59/0x90
> [<c1360b35>] ? apic_timer_interrupt+0x31/0x38
> [<c1360000>] ? rt_mutex_trylock+0x70/0x70
> ---[ end trace 9de668a859ee5d6c ]---
> tg3 0000:02:00.0: p2p1: transmit timed out, resetting
>
>
> --
> BOFH excuse #438:
>
> sticky bit has come loose
>
Saw many similar bugs report by simply google,
The root cause of this issue may be related to Broadcom tg3 firmware
and the version of tg3 hardware, so I think it is hard to get fix in
Linux driver. better way is get another NIC, or disable some its
feature to workaround if we got what feature block it (tso ? sg ? ).
Some debugging messages from other guys:
[ 3538.223529] tg3 0000:01:08.0: eth1: transmit timed out, resetting
[ 3538.229698] tg3 0000:01:08.0: eth1: DEBUG: MAC_TX_STATUS[00000008]
MAC_RX_STATUS[00000008]
[ 3538.236001] tg3 0000:01:08.0: eth1: DEBUG: RDMAC_STATUS[00000000]
WDMAC_STATUS[00000000]
[ 3538.343602] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=1800 enable_bit=2
[ 3538.449609] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
[ 3538.555402] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=4800 enable_bit=2
[ 3538.692079] tg3 0000:01:08.0: eth1: Link is down
We could see tg3_reset_hw()-->tg3_stop_fw()--> tg3_stop_block() timeout,
so the response of firmware is not right.
Just my 2 cents.
Ethan
On Wed, Jun 6, 2012 at 9:02 AM, Matt Carlson <[email protected]> wrote:
> I'm attempting to reproduce this in our lab. ?In the meantime,
> the latest revisions of the driver output a register dump and some
> additional information when transmit timeouts happen. ?It would be
> useful to see that data. ?Would it be possible to try a the latest
> kernels and get this information?
>
> On Mon, Jun 04, 2012 at 04:14:30PM -0700, Christian Kujau wrote:
>> Hi,
>>
>> on this Ideapad S10 the onboard Broadcom BCM5906M prints the warning
>> below, once. From then on, the "transmit timed out, resetting" message
>> repeats, every now and then.
>>
>> This laptop is mounting 2 readonly NFS shares from a box in the same LAN
>> and when scanning lots of files on these NFS shares, the transmit timeouts
>> occur more often, I think. When there's sequential traffic (i.e. reading
>> larger files from the NFS shares), fewer warnings occur. But this is just
>> manual observation, I haven't been able to reproduce this reliably.
>> However, there's constant traffic on the device (maybe ~700KB/s both tx
>> and rx), so the messages occur pretty regularly.
>>
>> I have reported the error against the Fedora 17 kernel [0] but it happens
>> with a vanilla 3.4.0 too[1] - check out for full dmesg, .config and more.
>>
>> I had a similar issue a while ago[2] and almost forgot about them. The
>> laptop ran Ubuntu 10.04 (2.6.32) since then and the problem was gone, so
>> I'd say 2.6.32 fixed it. Now the same laptop switched to Fedora, kernel
>> 3.3.4 and the problem seems to be back again.
>>
>> I'll try running with sg=off, as Matt suggested in [3] and report back.
>>
>> Thanks,
>> Christian.
>>
>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=825123
>> [1] http://nerdbynature.de/bits/3.4.0/tg3/
>> [2] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00004.html
>> [3] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00317.html
>>
>> ------------[ cut here ]------------
>> WARNING: at /opt/home/chrisk/dev/linux-2.6-git/net/sched/sch_generic.c:255
>> dev_watchdog+0x1cc/0x1e0()
>> Hardware name: Lenovo
>> NETDEV WATCHDOG: p2p1 (tg3): transmit queue 0 timed out
>> Modules linked in: acpi_cpufreq mperf freq_table nfs lockd sunrpc b43
>> mac80211 cfg80211 ssb coretemp hwmon usb_storage [last unloaded: scsi_wait_scan]
>> Pid: 685, comm: FahCore_78 Not tainted 3.4.0-10151-g4fc3acf #8
>> Call Trace:
>> ?[<c102b299>] ? warn_slowpath_common+0x79/0xb0
>> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
>> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
>> ?[<c102b374>] ? warn_slowpath_fmt+0x34/0x40
>> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
>> ?[<c12d5320>] ? pfifo_fast_dequeue+0xe0/0xe0
>> ?[<c1035cf1>] ? run_timer_softirq+0xd1/0x1d0
>> ?[<c1031615>] ? __do_softirq+0x75/0x100
>> ?[<c10315a0>] ? remote_softirq_receive+0x20/0x20
>> ?<IRQ> ?[<c10318a6>] ? irq_exit+0x66/0x90
>> ?[<c101b8d9>] ? smp_apic_timer_interrupt+0x59/0x90
>> ?[<c1360b35>] ? apic_timer_interrupt+0x31/0x38
>> ?[<c1360000>] ? rt_mutex_trylock+0x70/0x70
>> ---[ end trace 9de668a859ee5d6c ]---
>> tg3 0000:02:00.0: p2p1: transmit timed out, resetting
>>
>>
>> --
>> BOFH excuse #438:
>>
>> sticky bit has come loose
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
Hi Ethan. This device does not have any special firmware (beyond
bootcode). It shouldn't be necessary to disable any of the device's
features if it is working correctly.
Thanks for the debugging output. The tg3_stop_block() timeouts mean
that (a portion of) the chip is stuck somehow. Later drivers output a lot
more information than this. The additional information can help answer a
lot of questions in a short period of time. I was hoping I could
accomplish a lot more in fewer emails if I have more data available. :)
On Wed, Jun 06, 2012 at 09:58:42AM +0800, ethan zhao wrote:
> Saw many similar bugs report by simply google,
> The root cause of this issue may be related to Broadcom tg3 firmware
> and the version of tg3 hardware, so I think it is hard to get fix in
> Linux driver. better way is get another NIC, or disable some its
> feature to workaround if we got what feature block it (tso ? sg ? ).
>
> Some debugging messages from other guys:
>
> [ 3538.223529] tg3 0000:01:08.0: eth1: transmit timed out, resetting
> [ 3538.229698] tg3 0000:01:08.0: eth1: DEBUG: MAC_TX_STATUS[00000008]
> MAC_RX_STATUS[00000008]
> [ 3538.236001] tg3 0000:01:08.0: eth1: DEBUG: RDMAC_STATUS[00000000]
> WDMAC_STATUS[00000000]
> [ 3538.343602] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=1800 enable_bit=2
> [ 3538.449609] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
> [ 3538.555402] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=4800 enable_bit=2
> [ 3538.692079] tg3 0000:01:08.0: eth1: Link is down
>
> We could see tg3_reset_hw()-->tg3_stop_fw()--> tg3_stop_block() timeout,
> so the response of firmware is not right.
>
> Just my 2 cents.
>
> Ethan
>
>
> On Wed, Jun 6, 2012 at 9:02 AM, Matt Carlson <[email protected]> wrote:
> > I'm attempting to reproduce this in our lab. ?In the meantime,
> > the latest revisions of the driver output a register dump and some
> > additional information when transmit timeouts happen. ?It would be
> > useful to see that data. ?Would it be possible to try a the latest
> > kernels and get this information?
> >
> > On Mon, Jun 04, 2012 at 04:14:30PM -0700, Christian Kujau wrote:
> >> Hi,
> >>
> >> on this Ideapad S10 the onboard Broadcom BCM5906M prints the warning
> >> below, once. From then on, the "transmit timed out, resetting" message
> >> repeats, every now and then.
> >>
> >> This laptop is mounting 2 readonly NFS shares from a box in the same LAN
> >> and when scanning lots of files on these NFS shares, the transmit timeouts
> >> occur more often, I think. When there's sequential traffic (i.e. reading
> >> larger files from the NFS shares), fewer warnings occur. But this is just
> >> manual observation, I haven't been able to reproduce this reliably.
> >> However, there's constant traffic on the device (maybe ~700KB/s both tx
> >> and rx), so the messages occur pretty regularly.
> >>
> >> I have reported the error against the Fedora 17 kernel [0] but it happens
> >> with a vanilla 3.4.0 too[1] - check out for full dmesg, .config and more.
> >>
> >> I had a similar issue a while ago[2] and almost forgot about them. The
> >> laptop ran Ubuntu 10.04 (2.6.32) since then and the problem was gone, so
> >> I'd say 2.6.32 fixed it. Now the same laptop switched to Fedora, kernel
> >> 3.3.4 and the problem seems to be back again.
> >>
> >> I'll try running with sg=off, as Matt suggested in [3] and report back.
> >>
> >> Thanks,
> >> Christian.
> >>
> >> [0] https://bugzilla.redhat.com/show_bug.cgi?id=825123
> >> [1] http://nerdbynature.de/bits/3.4.0/tg3/
> >> [2] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00004.html
> >> [3] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00317.html
> >>
> >> ------------[ cut here ]------------
> >> WARNING: at /opt/home/chrisk/dev/linux-2.6-git/net/sched/sch_generic.c:255
> >> dev_watchdog+0x1cc/0x1e0()
> >> Hardware name: Lenovo
> >> NETDEV WATCHDOG: p2p1 (tg3): transmit queue 0 timed out
> >> Modules linked in: acpi_cpufreq mperf freq_table nfs lockd sunrpc b43
> >> mac80211 cfg80211 ssb coretemp hwmon usb_storage [last unloaded: scsi_wait_scan]
> >> Pid: 685, comm: FahCore_78 Not tainted 3.4.0-10151-g4fc3acf #8
> >> Call Trace:
> >> ?[<c102b299>] ? warn_slowpath_common+0x79/0xb0
> >> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> >> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> >> ?[<c102b374>] ? warn_slowpath_fmt+0x34/0x40
> >> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> >> ?[<c12d5320>] ? pfifo_fast_dequeue+0xe0/0xe0
> >> ?[<c1035cf1>] ? run_timer_softirq+0xd1/0x1d0
> >> ?[<c1031615>] ? __do_softirq+0x75/0x100
> >> ?[<c10315a0>] ? remote_softirq_receive+0x20/0x20
> >> ?<IRQ> ?[<c10318a6>] ? irq_exit+0x66/0x90
> >> ?[<c101b8d9>] ? smp_apic_timer_interrupt+0x59/0x90
> >> ?[<c1360b35>] ? apic_timer_interrupt+0x31/0x38
> >> ?[<c1360000>] ? rt_mutex_trylock+0x70/0x70
> >> ---[ end trace 9de668a859ee5d6c ]---
> >> tg3 0000:02:00.0: p2p1: transmit timed out, resetting
> >>
> >>
> >> --
> >> BOFH excuse #438:
> >>
> >> sticky bit has come loose
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at ?http://www.tux.org/lkml/
>
So no way to fix it via firmware update or Linux driver ? :<
On Wed, Jun 6, 2012 at 10:14 AM, Matt Carlson <[email protected]> wrote:
> Hi Ethan. ?This device does not have any special firmware (beyond
> bootcode). ?It shouldn't be necessary to disable any of the device's
> features if it is working correctly.
>
> Thanks for the debugging output. ?The tg3_stop_block() timeouts mean
> that (a portion of) the chip is stuck somehow. ?Later drivers output a lot
> more information than this. ?The additional information can help answer a
> lot of questions in a short period of time. ?I was hoping I could
> accomplish a lot more in fewer emails if I have more data available. :)
>
> On Wed, Jun 06, 2012 at 09:58:42AM +0800, ethan zhao wrote:
>> Saw many similar bugs report by simply google,
>> The root cause of this issue may be related to ?Broadcom tg3 firmware
>> and the version of tg3 hardware, so I think it is hard to get fix in
>> Linux driver. better way is get another NIC, or disable some its
>> feature to workaround if we got what feature block it (tso ? sg ? ).
>>
>> Some debugging messages from other guys:
>>
>> [ 3538.223529] tg3 0000:01:08.0: eth1: transmit timed out, resetting
>> [ 3538.229698] tg3 0000:01:08.0: eth1: DEBUG: MAC_TX_STATUS[00000008]
>> MAC_RX_STATUS[00000008]
>> [ 3538.236001] tg3 0000:01:08.0: eth1: DEBUG: RDMAC_STATUS[00000000]
>> WDMAC_STATUS[00000000]
>> [ 3538.343602] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=1800 enable_bit=2
>> [ 3538.449609] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
>> [ 3538.555402] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=4800 enable_bit=2
>> [ 3538.692079] tg3 0000:01:08.0: eth1: Link is down
>>
>> We could see tg3_reset_hw()-->tg3_stop_fw()--> tg3_stop_block() timeout,
>> so the response of firmware is not right.
>>
>> Just my 2 cents.
>>
>> Ethan
>>
>>
>> On Wed, Jun 6, 2012 at 9:02 AM, Matt Carlson <[email protected]> wrote:
>> > I'm attempting to reproduce this in our lab. ?In the meantime,
>> > the latest revisions of the driver output a register dump and some
>> > additional information when transmit timeouts happen. ?It would be
>> > useful to see that data. ?Would it be possible to try a the latest
>> > kernels and get this information?
>> >
>> > On Mon, Jun 04, 2012 at 04:14:30PM -0700, Christian Kujau wrote:
>> >> Hi,
>> >>
>> >> on this Ideapad S10 the onboard Broadcom BCM5906M prints the warning
>> >> below, once. From then on, the "transmit timed out, resetting" message
>> >> repeats, every now and then.
>> >>
>> >> This laptop is mounting 2 readonly NFS shares from a box in the same LAN
>> >> and when scanning lots of files on these NFS shares, the transmit timeouts
>> >> occur more often, I think. When there's sequential traffic (i.e. reading
>> >> larger files from the NFS shares), fewer warnings occur. But this is just
>> >> manual observation, I haven't been able to reproduce this reliably.
>> >> However, there's constant traffic on the device (maybe ~700KB/s both tx
>> >> and rx), so the messages occur pretty regularly.
>> >>
>> >> I have reported the error against the Fedora 17 kernel [0] but it happens
>> >> with a vanilla 3.4.0 too[1] - check out for full dmesg, .config and more.
>> >>
>> >> I had a similar issue a while ago[2] and almost forgot about them. The
>> >> laptop ran Ubuntu 10.04 (2.6.32) since then and the problem was gone, so
>> >> I'd say 2.6.32 fixed it. Now the same laptop switched to Fedora, kernel
>> >> 3.3.4 and the problem seems to be back again.
>> >>
>> >> I'll try running with sg=off, as Matt suggested in [3] and report back.
>> >>
>> >> Thanks,
>> >> Christian.
>> >>
>> >> [0] https://bugzilla.redhat.com/show_bug.cgi?id=825123
>> >> [1] http://nerdbynature.de/bits/3.4.0/tg3/
>> >> [2] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00004.html
>> >> [3] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00317.html
>> >>
>> >> ------------[ cut here ]------------
>> >> WARNING: at /opt/home/chrisk/dev/linux-2.6-git/net/sched/sch_generic.c:255
>> >> dev_watchdog+0x1cc/0x1e0()
>> >> Hardware name: Lenovo
>> >> NETDEV WATCHDOG: p2p1 (tg3): transmit queue 0 timed out
>> >> Modules linked in: acpi_cpufreq mperf freq_table nfs lockd sunrpc b43
>> >> mac80211 cfg80211 ssb coretemp hwmon usb_storage [last unloaded: scsi_wait_scan]
>> >> Pid: 685, comm: FahCore_78 Not tainted 3.4.0-10151-g4fc3acf #8
>> >> Call Trace:
>> >> ?[<c102b299>] ? warn_slowpath_common+0x79/0xb0
>> >> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
>> >> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
>> >> ?[<c102b374>] ? warn_slowpath_fmt+0x34/0x40
>> >> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
>> >> ?[<c12d5320>] ? pfifo_fast_dequeue+0xe0/0xe0
>> >> ?[<c1035cf1>] ? run_timer_softirq+0xd1/0x1d0
>> >> ?[<c1031615>] ? __do_softirq+0x75/0x100
>> >> ?[<c10315a0>] ? remote_softirq_receive+0x20/0x20
>> >> ?<IRQ> ?[<c10318a6>] ? irq_exit+0x66/0x90
>> >> ?[<c101b8d9>] ? smp_apic_timer_interrupt+0x59/0x90
>> >> ?[<c1360b35>] ? apic_timer_interrupt+0x31/0x38
>> >> ?[<c1360000>] ? rt_mutex_trylock+0x70/0x70
>> >> ---[ end trace 9de668a859ee5d6c ]---
>> >> tg3 0000:02:00.0: p2p1: transmit timed out, resetting
>> >>
>> >>
>> >> --
>> >> BOFH excuse #438:
>> >>
>> >> sticky bit has come loose
>> >>
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> > the body of a message to [email protected]
>> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at ?http://www.tux.org/lkml/
>>
>
On Wed, 2012-06-06 at 10:29 +0800, ethan zhao wrote:
> So no way to fix it via firmware update or Linux driver ? :<
Yes, but you need to cooperate, or else it might take more time than
necessary.
Asking questions like that on lkml is not going to help very much.
So, once again, we kindly ask you try a recent kernel and post
register dump and some additional information when transmit timeouts
happen.
The 'latest kernel' is either linux-3.5.rc1, or one of David Miller
tree :
http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=summary
or
http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=summary
Thanks
Eric,
That is ask for confirmation from Matt Carlson of Broadcom.
Ethan
On Wed, Jun 6, 2012 at 12:12 PM, Eric Dumazet <[email protected]> wrote:
> On Wed, 2012-06-06 at 10:29 +0800, ethan zhao wrote:
>> So no way to fix it via firmware update or Linux driver ? :<
>
> Yes, but you need to cooperate, or else it might take more time than
> necessary.
>
> Asking questions like that on lkml is not going to help very much.
>
> So, once again, we kindly ask you try a recent kernel and post
> register dump and some additional information when transmit timeouts
> happen.
>
> The 'latest kernel' is either linux-3.5.rc1, or one of David Miller
> tree :
>
> http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=summary
>
> or
>
> http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=summary
>
> Thanks
>
>
On Tue, 5 Jun 2012 at 18:02, Matt Carlson wrote:
> I'm attempting to reproduce this in our lab. In the meantime,
> the latest revisions of the driver output a register dump and some
> additional information when transmit timeouts happen. It would be
> useful to see that data.
I've only copied so much of the warning into my initial email, but after
that, much more followed, which looks like a register dump. I've put
everything (the whole logs and more) here:
http://nerdbynature.de/bits/3.4.0/tg3/
Is that what you're looking for?
> Would it be possible to try a the latest kernels and get this information?
I've observed this with 3.4, but I'll update to latest 3.5-git tomorrow
and let you know.
Thanks for replying,
Christian.
--
BOFH excuse #192:
runaway cat on system.
On Tue, 5 Jun 2012 at 23:17, Christian Kujau wrote:
> I've only copied so much of the warning into my initial email, but after
> that, much more followed, which looks like a register dump. I've put
> everything (the whole logs and more) here:
>
> http://nerdbynature.de/bits/3.4.0/tg3/
>
> Is that what you're looking for?
Have you had a chance looking at those outputs yet?
> > Would it be possible to try a the latest kernels and get this information?
I'm running today's git (3.5.0-rc1-00110-g71fae7e) and ~3.5h after
booting the same warning was printed, along with the register dump (if
that's what it is). I've put the full output online again:
http://nerdbynature.de/bits/3.4.0/tg3/
- messages_3.5.0-rc1-00110-g71fae7e.txt.gz
- config_3.5.0-rc1-00110-g71fae7e.gz
Thanks,
Christian.
--
BOFH excuse #10:
hardware stress fractures
On Thu, 7 Jun 2012 at 17:47, Ethan Zhao wrote:
> Could you try 3.5RC1+ with pcie_aspm=off kernel parameter ?
Will try.
> I notice there are some AER errors ( UnsupReq+,RxErr+) with the tg3
> from you lspci output, have you seen the AER errors on console ? if
> so, please attach them.
I haven't seen any actual erros on the console, except these messages
during bootup:
--------
pci0000:00: Requesting ACPI _OSC control (0x1d)
pci0000:00: ACPI _OSC request failed (AE_NOT_FOUND), returned control mask: 0x1d
ACPI _OSC control for PCIe not granted, disabling ASPM
--------
I have seen these messages in 3.2.0 (an Ubuntu kernel), can't say that I
have seen them before, they did not show up when I booted this 2.6.38
Ubuntu kernel.
I'll try booting with pcie_aspm=off and see what it gives...
Thanks,
Christian.
PS: AFAICT, Ubuntu's 2.6.38 [0] had these options set:
--------
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
--------
With 3.5.x, my .config has:
--------
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
--------
[0] https://launchpad.net/ubuntu/natty/i386/linux-image-2.6.38-15-generic/2.6.38-15.60
--
BOFH excuse #338:
old inkjet cartridges emanate barium-based fumes
On Thu, 7 Jun 2012 at 05:52, Christian Kujau wrote:
> On Thu, 7 Jun 2012 at 17:47, Ethan Zhao wrote:
> > Could you try 3.5RC1+ with pcie_aspm=off kernel parameter ?
Hm, this didn't help.
> > I notice there are some AER errors ( UnsupReq+,RxErr+) with the tg3
> > from you lspci output
Isn't lspci just listing ASPM _capabilities_ there? Booting with
pcie_aspm=off showed almost the same output:
http://nerdbynature.de/bits/3.4.0/tg3/lspci_aspm.diff.txt
So, the workaround for me is to disable "scatter-gather":
ethtool -K p2p1 sg off
With that, no more errors show up and the interface keeps working.
Christian.
> pci0000:00: Requesting ACPI _OSC control (0x1d)
> pci0000:00: ACPI _OSC request failed (AE_NOT_FOUND), returned control mask: 0x1d
> ACPI _OSC control for PCIe not granted, disabling ASPM
> --------
>
> I have seen these messages in 3.2.0 (an Ubuntu kernel), can't say that I
> have seen them before, they did not show up when I booted this 2.6.38
> Ubuntu kernel.
>
> I'll try booting with pcie_aspm=off and see what it gives...
>
> Thanks,
> Christian.
>
> PS: AFAICT, Ubuntu's 2.6.38 [0] had these options set:
>
> --------
> CONFIG_PCIEASPM=y
> # CONFIG_PCIEASPM_DEBUG is not set
> --------
>
> With 3.5.x, my .config has:
>
> --------
> CONFIG_PCIEASPM=y
> # CONFIG_PCIEASPM_DEBUG is not set
> CONFIG_PCIEASPM_DEFAULT=y
> # CONFIG_PCIEASPM_POWERSAVE is not set
> # CONFIG_PCIEASPM_PERFORMANCE is not set
> --------
>
> [0] https://launchpad.net/ubuntu/natty/i386/linux-image-2.6.38-15-generic/2.6.38-15.60
>
> --
> BOFH excuse #338:
>
> old inkjet cartridges emanate barium-based fumes
>
--
BOFH excuse #396:
Mail server hit by UniSpammer.
On Wed, Jun 06, 2012 at 12:52:32PM +0800, ethan zhao wrote:
> Eric,
> That is ask for confirmation from Matt Carlson of Broadcom.
>
> Ethan
>
> On Wed, Jun 6, 2012 at 12:12 PM, Eric Dumazet <[email protected]> wrote:
> > On Wed, 2012-06-06 at 10:29 +0800, ethan zhao wrote:
> >> So no way to fix it via firmware update or Linux driver ? :<
> >
> > Yes, but you need to cooperate, or else it might take more time than
> > necessary.
> >
> > Asking questions like that on lkml is not going to help very much.
> >
> > So, once again, we kindly ask you try a recent kernel and post
> > register dump and some additional information when transmit timeouts
> > happen.
> >
> > The 'latest kernel' is either linux-3.5.rc1, or one of David Miller
> > tree :
> >
> > http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=summary
> >
> > or
> >
> > http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=summary
> >
> > Thanks
Does the following patch fix your problem?
[PATCH] tg3: Apply short DMA frag workaround to 5906
5906 devices also need the short DMA fragment workaround. This patch
makes the necessary change.
Signed-off-by: Matt Carlson <[email protected]>
---
drivers/net/ethernet/broadcom/tg3.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index d55df32..2db4d70 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -14275,7 +14275,8 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
}
}
- if (tg3_flag(tp, 5755_PLUS))
+ if (tg3_flag(tp, 5755_PLUS) ||
+ GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906)
tg3_flag_set(tp, SHORT_DMA_BUG);
if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5719)
--
1.7.3.4
Matt,
I notice there are some AER errors ( UnsupReq+,RxErr+) with the tg3
from Christian' lspci output, do you know why and how to clear them ?
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq+ ACSViol-
CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
Thanks,
Ethan
On Fri, Jun 8, 2012 at 6:56 AM, Matt Carlson <[email protected]> wrote:
> On Wed, Jun 06, 2012 at 12:52:32PM +0800, ethan zhao wrote:
>> Eric,
>> ? That is ask for confirmation from Matt Carlson of Broadcom.
>>
>> Ethan
>>
>> On Wed, Jun 6, 2012 at 12:12 PM, Eric Dumazet <[email protected]> wrote:
>> > On Wed, 2012-06-06 at 10:29 +0800, ethan zhao wrote:
>> >> So no way to fix it via firmware update or Linux driver ? :<
>> >
>> > Yes, but you need to cooperate, or else it might take more time than
>> > necessary.
>> >
>> > Asking questions like that on lkml is not going to help very much.
>> >
>> > So, once again, we kindly ask you try a recent kernel and post
>> > register dump and some additional information when transmit timeouts
>> > happen.
>> >
>> > The 'latest kernel' is either linux-3.5.rc1, or one of David Miller
>> > tree :
>> >
>> > http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=summary
>> >
>> > or
>> >
>> > http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=summary
>> >
>> > Thanks
>
> Does the following patch fix your problem?
>
>
> [PATCH] tg3: Apply short DMA frag workaround to 5906
>
> 5906 devices also need the short DMA fragment workaround. ?This patch
> makes the necessary change.
>
> Signed-off-by: Matt Carlson <[email protected]>
> ---
> ?drivers/net/ethernet/broadcom/tg3.c | ? ?3 ++-
> ?1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index d55df32..2db4d70 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -14275,7 +14275,8 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
> ? ? ? ? ? ? ? ?}
> ? ? ? ?}
>
> - ? ? ? if (tg3_flag(tp, 5755_PLUS))
> + ? ? ? if (tg3_flag(tp, 5755_PLUS) ||
> + ? ? ? ? ? GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906)
> ? ? ? ? ? ? ? ?tg3_flag_set(tp, SHORT_DMA_BUG);
>
> ? ? ? ?if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5719)
> --
> 1.7.3.4
>
>
From: "Matt Carlson" <[email protected]>
Date: Thu, 7 Jun 2012 15:56:54 -0700
> Does the following patch fix your problem?
>
>
> [PATCH] tg3: Apply short DMA frag workaround to 5906
>
> 5906 devices also need the short DMA fragment workaround. This patch
> makes the necessary change.
>
> Signed-off-by: Matt Carlson <[email protected]>
Ping, what's the status of this?
From: Christian Kujau <[email protected]>
Date: Mon, 11 Jun 2012 16:53:16 -0700 (PDT)
> Tested-by: Christian Kujau <[email protected]>
Great, applied, thanks everyone.