2010-11-22 18:41:59

by Larry Finger

[permalink] [raw]
Subject: b43 error under heavy load

Michael,

Under heavy load with multiple streams, I'm getting errors from my BCM4312
(14e4:4315) device as follows:

b43-phy0 debug: TX-status contains invalid cookie: 0x0D4D
b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 76
b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 78
(and continues).

Clearly that cookie is garbage as it has a slot number of 0. Any suggestions on
diagnostic information that I could dump to help debug this?

Thanks,

Larry


2010-11-22 19:12:37

by Larry Finger

[permalink] [raw]
Subject: Re: b43 error under heavy load

On 11/22/2010 12:53 PM, Michael Büsch wrote:
> On Mon, 2010-11-22 at 12:41 -0600, Larry Finger wrote:
>> Under heavy load with multiple streams, I'm getting errors from my BCM4312
>> (14e4:4315) device as follows:
>>
>> b43-phy0 debug: TX-status contains invalid cookie: 0x0D4D
>> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 76
>> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 78
>> (and continues).
>>
>> Clearly that cookie is garbage as it has a slot number of 0. Any suggestions on
>> diagnostic information that I could dump to help debug this?
>
> I think that probably is a firmware or hardware bug. The affected
> kernel code was not changed in years, as far as I know.

Thanks. I'll see if I can figure out a way for the driver to recover more
gracefully. In the meantime, I will be dumping the entire TX status struct to
see if that helps in seeing what is wrong.

Larry

2010-11-22 18:53:20

by Michael Büsch

[permalink] [raw]
Subject: Re: b43 error under heavy load

On Mon, 2010-11-22 at 12:41 -0600, Larry Finger wrote:
> Under heavy load with multiple streams, I'm getting errors from my BCM4312
> (14e4:4315) device as follows:
>
> b43-phy0 debug: TX-status contains invalid cookie: 0x0D4D
> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 76
> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 78
> (and continues).
>
> Clearly that cookie is garbage as it has a slot number of 0. Any suggestions on
> diagnostic information that I could dump to help debug this?

I think that probably is a firmware or hardware bug. The affected
kernel code was not changed in years, as far as I know.

--
Greetings Michael.


2011-05-10 05:08:22

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

2010/11/22 Larry Finger <[email protected]>:
> On 11/22/2010 12:53 PM, Michael Büsch wrote:
>> On Mon, 2010-11-22 at 12:41 -0600, Larry Finger wrote:
>>> Under heavy load with multiple streams, I'm getting errors from my BCM4312
>>> (14e4:4315) device as follows:
>>>
>>> b43-phy0 debug: TX-status contains invalid cookie: 0x0D4D
>>> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 76
>>> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 78
>>> (and continues).
>>>
>>> Clearly that cookie is garbage as it has a slot number of 0. Any suggestions on
>>> diagnostic information that I could dump to help debug this?
>>
>> I think that probably is a firmware or hardware bug. The affected
>> kernel code was not changed in years, as far as I know.
>
> Thanks. I'll see if I can figure out a way for the driver to recover more
> gracefully. In the meantime, I will be dumping the entire TX status struct to
> see if that helps in seeing what is wrong.

Larry, did you try debugging this? Juan noticed this issue on his
14e4:4315 as well.
[ 854.826369] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 54
[ 854.827348] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 56
[ 854.828080] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 58
[ 854.831431] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 60
[ 854.831453] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 62
[ 854.832440] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 64
[ 854.833836] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 66
[ 854.834567] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 68
[ 854.834588] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 70
[ 854.835378] b43-phy0 debug: Out of order TX status report on DMA
ring 1. Expected 74, but got 72

Connection is lost, reloading of the driver is needer.

--
Rafał

2011-05-10 18:34:39

by Larry Finger

[permalink] [raw]
Subject: Re: b43 error under heavy load

On 05/10/2011 12:08 AM, Rafał Miłecki wrote:
> 2010/11/22 Larry Finger<[email protected]>:
>> On 11/22/2010 12:53 PM, Michael Büsch wrote:
>>> On Mon, 2010-11-22 at 12:41 -0600, Larry Finger wrote:
>>>> Under heavy load with multiple streams, I'm getting errors from my BCM4312
>>>> (14e4:4315) device as follows:
>>>>
>>>> b43-phy0 debug: TX-status contains invalid cookie: 0x0D4D
>>>> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 76
>>>> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 78
>>>> (and continues).
>>>>
>>>> Clearly that cookie is garbage as it has a slot number of 0. Any suggestions on
>>>> diagnostic information that I could dump to help debug this?
>>>
>>> I think that probably is a firmware or hardware bug. The affected
>>> kernel code was not changed in years, as far as I know.
>>
>> Thanks. I'll see if I can figure out a way for the driver to recover more
>> gracefully. In the meantime, I will be dumping the entire TX status struct to
>> see if that helps in seeing what is wrong.
>
> Larry, did you try debugging this? Juan noticed this issue on his
> 14e4:4315 as well.
> [ 854.826369] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 54
> [ 854.827348] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 56
> [ 854.828080] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 58
> [ 854.831431] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 60
> [ 854.831453] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 62
> [ 854.832440] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 64
> [ 854.833836] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 66
> [ 854.834567] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 68
> [ 854.834588] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 70
> [ 854.835378] b43-phy0 debug: Out of order TX status report on DMA
> ring 1. Expected 74, but got 72
>
> Connection is lost, reloading of the driver is needer.

I have not gotten any further with this problem, but I just noticed a curious
"coincidence". The expected status number was 74 in both cases. I do not think
there is anything special about that particular slot. After all, there are 256
TX ring slots. As the comment says at line 1407 of dma.c, there might be a
firmware error. If so, it is common to several versions as I have seen it on a
BCM4318.

Larry

2011-05-10 21:19:14

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

2011/5/10 Larry Finger <[email protected]>:
> On 05/10/2011 12:08 AM, Rafał Miłecki wrote:
>>
>> 2010/11/22 Larry Finger<[email protected]>:
>>>
>>> On 11/22/2010 12:53 PM, Michael Büsch wrote:
>>>>
>>>> On Mon, 2010-11-22 at 12:41 -0600, Larry Finger wrote:
>>>>>
>>>>> Under heavy load with multiple streams, I'm getting errors from my
>>>>> BCM4312
>>>>> (14e4:4315) device as follows:
>>>>>
>>>>> b43-phy0 debug: TX-status contains invalid cookie: 0x0D4D
>>>>> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected
>>>>> 74, but got 76
>>>>> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected
>>>>> 74, but got 78
>>>>> (and continues).
>>>>>
>>>>> Clearly that cookie is garbage as it has a slot number of 0. Any
>>>>> suggestions on
>>>>> diagnostic information that I could dump to help debug this?
>>>>
>>>> I think that probably is a firmware or hardware bug. The affected
>>>> kernel code was not changed in years, as far as I know.
>>>
>>> Thanks. I'll see if I can figure out a way for the driver to recover more
>>> gracefully. In the meantime, I will be dumping the entire TX status
>>> struct to
>>> see if that helps in seeing what is wrong.
>>
>> Larry, did you try debugging this? Juan noticed this issue on his
>> 14e4:4315 as well.
>> [  854.826369] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 54
>> [  854.827348] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 56
>> [  854.828080] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 58
>> [  854.831431] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 60
>> [  854.831453] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 62
>> [  854.832440] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 64
>> [  854.833836] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 66
>> [  854.834567] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 68
>> [  854.834588] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 70
>> [  854.835378] b43-phy0 debug: Out of order TX status report on DMA
>> ring 1. Expected 74, but got 72
>>
>> Connection is lost, reloading of the driver is needer.
>
> I have not gotten any further with this problem, but I just noticed a
> curious "coincidence". The expected status number was 74 in both cases. I do
> not think there is anything special about that particular slot. After all,
> there are 256 TX ring slots. As the comment says at line 1407 of dma.c,
> there might be a firmware error. If so, it is common to several versions as
> I have seen it on a BCM4318.

Yeah, it seems to be always 74 (the expected one). Here is the full log:
http://pastebin.com/bDpVXTtb

It's the first time I see the full log. It perfectly shows firmware
omitted 74, transmitted every next packet until reaching 72 and then
stopped.

Do you have any idea what can we do to debug this?

It seems to happen only under heavy load... maybe it has something to
do with reaching end of the ring (not whole ring, just end of free
slots)?

Did you try verbose debugging? I wonder if there is any relation
between Out of order and:
if (b43_debug(dev, B43_DBG_DMAVERBOSE)) {
b43dbg(dev->wl, "Stopped TX ring %d\n", ring->index);
}

Could you try it?

--
Rafał

2011-06-01 12:25:24

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

2011/6/1 Chris Vine <[email protected]>:
> On Wed, 1 Jun 2011 12:56:22 +0200
> Rafał Miłecki <[email protected]> wrote:
>> 2011/6/1 Chris Vine <[email protected]>:
>> > On Wed, 1 Jun 2011 11:23:45 +0200
>> > Rafał Miłecki <[email protected]> wrote:
>> >> AFAIK to enable this debugging you only need to:
>> >> echo 1 > /sys/kernel/debug/b43/phy0/debug_dmaverbose
>> >
>> > I don't have a /sys/kernel/debug directory with a running 3.0.0-rc1
>> > kernel, so it appears that b43 debugging (which I do have enabled)
>> > doesn't use it.
>>
>> mount -t debugfs none /sys/kernel/debug/
>
> Ah, so I need to compile in debugfs.  That isn't/wasn't necessary for
> the DMA error debugging.  I am surprised that debugfs enables you to
> alter kernel debugging levels on the fly (I thought it was a
> passive logging mechanism for kernel state), but you live and learn.

Removing following if () condition:
if (b43_debug(dev, B43_DBG_DMAVERBOSE)) {
will work fine as well ;)
You can just leave:
b43dbg(dev->wl, "Stopped TX ring %d\n", ring->index);
without condition around it (and make sure to enable B43 debugging).


> I will compile in debugfs, but I don't expect to have any rapid results
> for you. With 3 and a half hours of streaming yesterday it happened
> once.  I won't be able to do much testing by way of transferring files
> over the LAN for a while either (I imagine that would provide greater
> stress testing).

I think you should easily get this error by transmitting. Streaming
some video is mostly receiving. Just putting some random
(ftp/sftp/iperf) server in the network would make the trick.

--
Rafał

2011-06-01 09:23:46

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

W dniu 1 czerwca 2011 01:29 użytkownik Chris Vine
<[email protected]> napisał:
> However, fixing this bug reveals something else.  After about 2 hours of
> streaming on MLB.TV I got this, apparently at random (but this is not a
> hard error: by disassociating from the AP and reassociating, the rest of
> the game played fine, and there was no need to load or unload modules):
>
> b43-phy0 debug: TX-status contains invalid cookie: 0x0DCE
> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 76

Chris, my first suspicion for this issue is filling the whole TX
buffer. This could explain what does it happen on high traffic.

Below there is an idea how we can easily test it. Can you take a look
at this, and perform some testing?


2011/5/10 Rafał Miłecki <[email protected]>:
> It seems to happen only under heavy load... maybe it has something to
> do with reaching end of the ring (not whole ring, just end of free
> slots)?
>
> Did you try verbose debugging? I wonder if there is any relation
> between Out of order and:
> if (b43_debug(dev, B43_DBG_DMAVERBOSE)) {
>        b43dbg(dev->wl, "Stopped TX ring %d\n", ring->index);
> }
>
> Could you try it?

AFAIK to enable this debugging you only need to:
echo 1 > /sys/kernel/debug/b43/phy0/debug_dmaverbose

--
Rafał

2011-06-01 10:48:58

by Chris Vine

[permalink] [raw]
Subject: Re: b43 error under heavy load

On Wed, 1 Jun 2011 11:23:45 +0200
Rafał Miłecki <[email protected]> wrote:
> W dniu 1 czerwca 2011 01:29 użytkownik Chris Vine
> <[email protected]> napisał:
> > However, fixing this bug reveals something else.  After about 2
> > hours of streaming on MLB.TV I got this, apparently at random (but
> > this is not a hard error: by disassociating from the AP and
> > reassociating, the rest of the game played fine, and there was no
> > need to load or unload modules):
> >
> > b43-phy0 debug: TX-status contains invalid cookie: 0x0DCE
> > b43-phy0 debug: Out of order TX status report on DMA ring 1.
> > Expected 74, but got 76
>
> Chris, my first suspicion for this issue is filling the whole TX
> buffer. This could explain what does it happen on high traffic.
>
> Below there is an idea how we can easily test it. Can you take a look
> at this, and perform some testing?
>
>
> 2011/5/10 Rafał Miłecki <[email protected]>:
> > It seems to happen only under heavy load... maybe it has something
> > to do with reaching end of the ring (not whole ring, just end of
> > free slots)?
> >
> > Did you try verbose debugging? I wonder if there is any relation
> > between Out of order and:
> > if (b43_debug(dev, B43_DBG_DMAVERBOSE)) {
> >        b43dbg(dev->wl, "Stopped TX ring %d\n", ring->index);
> > }
> >
> > Could you try it?
>
> AFAIK to enable this debugging you only need to:
> echo 1 > /sys/kernel/debug/b43/phy0/debug_dmaverbose

I don't have a /sys/kernel/debug directory with a running 3.0.0-rc1
kernel, so it appears that b43 debugging (which I do have enabled)
doesn't use it.

On the question of whether it is a firmware problem, I have not
experienced a similar issue with the wl driver, but the firmware that
b43 uses from broadcom-wl-4.178.10.4 seems to be an earlier version, so
that does not rule out the firmware: the earliest hybrid-portsrc wl
driver I have used is v5.10.91.9.3.

Chris



2011-06-01 10:56:23

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

2011/6/1 Chris Vine <[email protected]>:
> On Wed, 1 Jun 2011 11:23:45 +0200
> Rafał Miłecki <[email protected]> wrote:
>> W dniu 1 czerwca 2011 01:29 użytkownik Chris Vine
>> <[email protected]> napisał:
>> > However, fixing this bug reveals something else.  After about 2
>> > hours of streaming on MLB.TV I got this, apparently at random (but
>> > this is not a hard error: by disassociating from the AP and
>> > reassociating, the rest of the game played fine, and there was no
>> > need to load or unload modules):
>> >
>> > b43-phy0 debug: TX-status contains invalid cookie: 0x0DCE
>> > b43-phy0 debug: Out of order TX status report on DMA ring 1.
>> > Expected 74, but got 76
>>
>> Chris, my first suspicion for this issue is filling the whole TX
>> buffer. This could explain what does it happen on high traffic.
>>
>> Below there is an idea how we can easily test it. Can you take a look
>> at this, and perform some testing?
>>
>>
>> 2011/5/10 Rafał Miłecki <[email protected]>:
>> > It seems to happen only under heavy load... maybe it has something
>> > to do with reaching end of the ring (not whole ring, just end of
>> > free slots)?
>> >
>> > Did you try verbose debugging? I wonder if there is any relation
>> > between Out of order and:
>> > if (b43_debug(dev, B43_DBG_DMAVERBOSE)) {
>> >        b43dbg(dev->wl, "Stopped TX ring %d\n", ring->index);
>> > }
>> >
>> > Could you try it?
>>
>> AFAIK to enable this debugging you only need to:
>> echo 1 > /sys/kernel/debug/b43/phy0/debug_dmaverbose
>
> I don't have a /sys/kernel/debug directory with a running 3.0.0-rc1
> kernel, so it appears that b43 debugging (which I do have enabled)
> doesn't use it.

mount -t debugfs none /sys/kernel/debug/


> On the question of whether it is a firmware problem, I have not
> experienced a similar issue with the wl driver, but the firmware that
> b43 uses from broadcom-wl-4.178.10.4 seems to be an earlier version, so
> that does not rule out the firmware: the earliest hybrid-portsrc wl
> driver I have used is v5.10.91.9.3.

--
Rafał

2011-06-01 18:09:47

by Chris Vine

[permalink] [raw]
Subject: Re: b43 error under heavy load

On Wed, 1 Jun 2011 20:03:56 +0200
Rafał Miłecki <[email protected]> wrote:
> 2011/6/1 Chris Vine <[email protected]>:
> > The one which failed caused the cessation of
> > wireless traffic, and was accompanied by the debug log reports of
> > out of order TX status earlier referred to, and with only one
> > single report
>
> As I said, I don't really suspect any relation between full TX ring
> and out of order problems anymore.
>
> But it won't hurt to ask just in case:
> Did you see relation between this single "Stopped TX ring 1" and
> massive out of order messages? Did out of order messages starter right
> after "Stopped TX ring 1" one?

The "Stopped TX ring 1" message was near the end (but not at the end)
of the out of order messages. It was definitely not at the beginning of
the out of order messages.

Chris



2011-06-01 15:01:20

by Chris Vine

[permalink] [raw]
Subject: Re: b43 error under heavy load

On Wed, 1 Jun 2011 14:25:23 +0200
Rafał Miłecki <[email protected]> wrote:
> I think you should easily get this error by transmitting. Streaming
> some video is mostly receiving. Just putting some random
> (ftp/sftp/iperf) server in the network would make the trick.

OK rather than recompile the kernel with debugfs enabled, as you
suggested I took the debugging call to b43dbg() out of the
B43_DBG_DMAVERBOSE if block (so it is entered whether or not verbose
debugging is set).

I transferred a 268MB file across the LAN in a little over 5 minutes
(so the transfer speed was a little under 10Mb/s). During the course
of the transfer I got about 500 "b43-phy0 debug: Stopped TX ring 1"
statements logged.

However the interesting thing is that with this debugging statement
included, I got no messages about any out of order TX status; instead,
apart from the overrun messages to the debug log, the link performed
normally. The file transfer did not fail (I have checked that the file
received is identical to the file sent) nor was the link to the router
lost. Probably writing to the debug log has changed some timing race
somewhere to the benefit of link integrity.

However, as I said, I am not going to be in a position to do much
testing by way of transferring further files over the LAN for a period
of time, for unrelated reasons.

Chris



2011-06-01 09:24:22

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

W dniu 1 czerwca 2011 01:29 użytkownik Chris Vine
<[email protected]> napisał:
> However, fixing this bug reveals something else.  After about 2 hours of
> streaming on MLB.TV I got this, apparently at random (but this is not a
> hard error: by disassociating from the AP and reassociating, the rest of
> the game played fine, and there was no need to load or unload modules):
>
> b43-phy0 debug: TX-status contains invalid cookie: 0x0DCE
> b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 76

Chris, my first suspicion for this issue is filling the whole TX
buffer. This could explain what does it happen on high traffic.

Below there is an idea how we can easily test it. Can you take a look
at this, and perform some testing?


2011/5/10 Rafał Miłecki <[email protected]>:
> It seems to happen only under heavy load... maybe it has something to
> do with reaching end of the ring (not whole ring, just end of free
> slots)?
>
> Did you try verbose debugging? I wonder if there is any relation
> between Out of order and:
> if (b43_debug(dev, B43_DBG_DMAVERBOSE)) {
>        b43dbg(dev->wl, "Stopped TX ring %d\n", ring->index);
> }
>
> Could you try it?

AFAIK to enable this debugging you only need to:
echo 1 > /sys/kernel/debug/b43/phy0/debug_dmaverbose

--
Rafał

2011-06-01 17:49:49

by Chris Vine

[permalink] [raw]
Subject: Re: b43 error under heavy load

On Wed, 1 Jun 2011 17:42:55 +0200
Rafał Miłecki <[email protected]> wrote:
> 2011/6/1 Chris Vine <[email protected]>:
> > On Wed, 1 Jun 2011 14:25:23 +0200
> > Rafał Miłecki <[email protected]> wrote:
> >> I think you should easily get this error by transmitting. Streaming
> >> some video is mostly receiving. Just putting some random
> >> (ftp/sftp/iperf) server in the network would make the trick.
> >
> > OK rather than recompile the kernel with debugfs enabled, as you
> > suggested I took the debugging call to b43dbg() out of the
> > B43_DBG_DMAVERBOSE if block (so it is entered whether or not verbose
> > debugging is set).
> >
> > I transferred a 268MB file across the LAN in a little over 5 minutes
> > (so the transfer speed was a little under 10Mb/s).  During the
> > course of the transfer I got about 500 "b43-phy0 debug: Stopped TX
> > ring 1" statements logged.
> >
> > However the interesting thing is that with this debugging statement
> > included, I got no messages about any out of order TX status;
> > instead, apart from the overrun messages to the debug log, the link
> > performed normally. The file transfer did not fail (I have checked
> > that the file received is identical to the file sent) nor was the
> > link to the router lost. Probably writing to the debug log has
> > changed some timing race somewhere to the benefit of link integrity.
> >
> > However, as I said, I am not going to be in a position to do much
> > testing by way of transferring further files over the LAN for a
> > period of time, for unrelated reasons.
>
> Well, it just seems that hitting full TX ring does not cause firmware
> problems and out of order issue. However I've no idea what else can
> cause it. We were also seeing this issue with free firmware on G-PHY
> cards.
>
> Maybe this is some hardware issue firmware has to workaround? Maybe
> updating firmware could help? My next idea is to try 508.1107
> firmware.

My earlier report didn't test in both directions (it was for a netbook
to desktop transfer). I have now made a number of transfers in both
directions of a 268MB file using sftp, and the results are below.
Both ends have sshd installed and running, and it is the sending machine
whose ssh daemon is active for the transfer in question: in other
words, all the transfers are get rather than put operations.

My netbook is the computer with the broadcom wireless device. My
desktop doesn't use wireless: it is connected to the router via 100Mb/s
ethernet. Therefore, the transfer speeds are limited by the wireless
link to the netbook rather than the ethernet link to the desktop.

The transfers from the netbook to the desktop computers took an average
of 4 mins 28 secs, and on each occasion the file transfer completed
successfully and the wireless link stayed up, although I got the
repeated reports of "b43-phy0 debug: Stopped TX ring 1" to which I have
earlier referred.

The transfers from the desktop to the netbook, when they succeeded, were
faster, taking an average of 2 mins 35 secs (this is surprisingly
quick for a 268MB file). Of the three transfer attempts I made, two
succeeded, with no error messages or any kind reported to the debug
log, and one failed. The one which failed caused the cessation of
wireless traffic, and was accompanied by the debug log reports of out
of order TX status earlier referred to, and with only one single report
in the debug log of "Stopped TX ring 1". In the case of the failed
transfer, I brought the wireless link back up by disassociating and
then reassociating with the AP/router. It was not necessary to unload
and reload the b43 module, so there was no hard error involved.

Summary: Traffic sent up from the broadcom wireless device generates
copious reports of "Stopped TX ring 1" but always carries on with its
job and stays up, although its traffic is slower than on received
packets. Received traffic on the other hand reports no errors until
the spate of "Out of order TX status report on DMA ring 1" errors
occurs, which seems to happen at random (albeit accompanied on my
failed transfer by a single "Stopped TX ring 1" log entry), and when it
does happen brings the wireless link to a halt. Wireless traffic can be
restarted simply by reassociating with the AP.

With that, I am afraid that really is it for a few days.

Chris




2011-06-01 15:42:57

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

2011/6/1 Chris Vine <[email protected]>:
> On Wed, 1 Jun 2011 14:25:23 +0200
> Rafał Miłecki <[email protected]> wrote:
>> I think you should easily get this error by transmitting. Streaming
>> some video is mostly receiving. Just putting some random
>> (ftp/sftp/iperf) server in the network would make the trick.
>
> OK rather than recompile the kernel with debugfs enabled, as you
> suggested I took the debugging call to b43dbg() out of the
> B43_DBG_DMAVERBOSE if block (so it is entered whether or not verbose
> debugging is set).
>
> I transferred a 268MB file across the LAN in a little over 5 minutes
> (so the transfer speed was a little under 10Mb/s).  During the course
> of the transfer I got about 500 "b43-phy0 debug: Stopped TX ring 1"
> statements logged.
>
> However the interesting thing is that with this debugging statement
> included, I got no messages about any out of order TX status; instead,
> apart from the overrun messages to the debug log, the link performed
> normally. The file transfer did not fail (I have checked that the file
> received is identical to the file sent) nor was the link to the router
> lost. Probably writing to the debug log has changed some timing race
> somewhere to the benefit of link integrity.
>
> However, as I said, I am not going to be in a position to do much
> testing by way of transferring further files over the LAN for a period
> of time, for unrelated reasons.

Well, it just seems that hitting full TX ring does not cause firmware
problems and out of order issue. However I've no idea what else can
cause it. We were also seeing this issue with free firmware on G-PHY
cards.

Maybe this is some hardware issue firmware has to workaround? Maybe
updating firmware could help? My next idea is to try 508.1107
firmware.

--
Rafał

2011-06-01 18:03:57

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

2011/6/1 Chris Vine <[email protected]>:
> The one which failed caused the cessation of
> wireless traffic, and was accompanied by the debug log reports of out
> of order TX status earlier referred to, and with only one single report

As I said, I don't really suspect any relation between full TX ring
and out of order problems anymore.

But it won't hurt to ask just in case:
Did you see relation between this single "Stopped TX ring 1" and
massive out of order messages? Did out of order messages starter right
after "Stopped TX ring 1" one?

--
Rafał

2011-06-01 18:10:45

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

2011/6/1 Chris Vine <[email protected]>:
> On Wed, 1 Jun 2011 20:03:56 +0200
> Rafał Miłecki <[email protected]> wrote:
>> 2011/6/1 Chris Vine <[email protected]>:
>> > The one which failed caused the cessation of
>> > wireless traffic, and was accompanied by the debug log reports of
>> > out of order TX status earlier referred to, and with only one
>> > single report
>>
>> As I said, I don't really suspect any relation between full TX ring
>> and out of order problems anymore.
>>
>> But it won't hurt to ask just in case:
>> Did you see relation between this single "Stopped TX ring 1" and
>> massive out of order messages? Did out of order messages starter right
>> after "Stopped TX ring 1" one?
>
> The "Stopped TX ring 1" message was near the end (but not at the end)
> of the out of order messages. It was definitely not at the beginning of
> the out of order messages.

Thanks. That confirmed for me, there is no relation.

--
Rafał

2011-06-01 12:08:58

by Chris Vine

[permalink] [raw]
Subject: Re: b43 error under heavy load

On Wed, 1 Jun 2011 12:56:22 +0200
Rafał Miłecki <[email protected]> wrote:
> 2011/6/1 Chris Vine <[email protected]>:
> > On Wed, 1 Jun 2011 11:23:45 +0200
> > Rafał Miłecki <[email protected]> wrote:
> >> AFAIK to enable this debugging you only need to:
> >> echo 1 > /sys/kernel/debug/b43/phy0/debug_dmaverbose
> >
> > I don't have a /sys/kernel/debug directory with a running 3.0.0-rc1
> > kernel, so it appears that b43 debugging (which I do have enabled)
> > doesn't use it.
>
> mount -t debugfs none /sys/kernel/debug/

Ah, so I need to compile in debugfs. That isn't/wasn't necessary for
the DMA error debugging. I am surprised that debugfs enables you to
alter kernel debugging levels on the fly (I thought it was a
passive logging mechanism for kernel state), but you live and learn.

I will compile in debugfs, but I don't expect to have any rapid results
for you. With 3 and a half hours of streaming yesterday it happened
once. I won't be able to do much testing by way of transferring files
over the LAN for a while either (I imagine that would provide greater
stress testing).

Chris



2011-07-26 09:28:51

by Chris Vine

[permalink] [raw]
Subject: Re: b43 error under heavy load

On Wed, 1 Jun 2011 20:10:44 +0200
Rafał Miłecki <[email protected]> wrote:
> 2011/6/1 Chris Vine <[email protected]>:
> > On Wed, 1 Jun 2011 20:03:56 +0200
> > Rafał Miłecki <[email protected]> wrote:
> >> 2011/6/1 Chris Vine <[email protected]>:
> >> > The one which failed caused the cessation of
> >> > wireless traffic, and was accompanied by the debug log reports of
> >> > out of order TX status earlier referred to, and with only one
> >> > single report
> >>
> >> As I said, I don't really suspect any relation between full TX ring
> >> and out of order problems anymore.
> >>
> >> But it won't hurt to ask just in case:
> >> Did you see relation between this single "Stopped TX ring 1" and
> >> massive out of order messages? Did out of order messages starter
> >> right after "Stopped TX ring 1" one?
> >
> > The "Stopped TX ring 1" message was near the end (but not at the
> > end) of the out of order messages. It was definitely not at the
> > beginning of the out of order messages.
>
> Thanks. That confirmed for me, there is no relation.

I have tested with the newer firmware (from broadcom-wl-5.10.56.27.3),
and I get the same failure with "Out of order TX status report on DMA
ring 1" after transferring about 380MB of data at full speed.

Chris



2011-08-14 21:19:58

by Chris Vine

[permalink] [raw]
Subject: Re: b43 error under heavy load

On Sun, 14 Aug 2011 22:54:20 +0200
Rafał Miłecki <[email protected]> wrote:
> I didn't expect improvement, I tried to make it worse :P I though our
> problems may be coming from not-fast-enough reading RX ring. So you
> just made that ring even smaller to allow faster/easier errors.
>
> What do you mean by different errors? The only one I can see is
> "b43-phy0 ERROR: PHY transmission error". Do you mean that or
> something else? Did you post all the log you wanted to? Maybe missed
> beginning?

That's it.

On your question "what do you mean by different errors", I mean the
reported errors are different. Without that patch the reported error
is:

b43-phy0 debug: TX-status contains invalid cookie: 0x0DCE
b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 76
b43-phy0 debug: Out of order TX status report on DMA ring 1. Expected 74, but got 78
... and so on.

Chris



2011-08-14 11:10:40

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

Hey Chris,

2011/6/1 Chris Vine <[email protected]>:
> Summary: Traffic sent up from the broadcom wireless device generates
> copious reports of "Stopped TX ring 1" but always carries on with its
> job and stays up, although its traffic is slower than on received
> packets.  Received traffic on the other hand reports no errors until
> the spate of "Out of order TX status report on DMA ring 1" errors
> occurs, which seems to happen at random (albeit accompanied on my
> failed transfer by a single "Stopped TX ring 1" log entry), and when it
> does happen brings the wireless link to a halt. Wireless traffic can be
> restarted simply by reassociating with the AP.

So it seems heavy load of RX can cause problems with TX. I just though
of one another reason, could you test one more thing for me?

Edit drivers/net/wireless/b43/dma.h and change define from value 64 to
in the line:
#define B43_RXRING_SLOTS 64

Can you get "Out of order TX" much more easily after such a change?

--
Rafał

2011-08-14 11:11:12

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

2011/8/14 Rafał Miłecki <[email protected]>:
> Hey Chris,
>
> 2011/6/1 Chris Vine <[email protected]>:
>> Summary: Traffic sent up from the broadcom wireless device generates
>> copious reports of "Stopped TX ring 1" but always carries on with its
>> job and stays up, although its traffic is slower than on received
>> packets.  Received traffic on the other hand reports no errors until
>> the spate of "Out of order TX status report on DMA ring 1" errors
>> occurs, which seems to happen at random (albeit accompanied on my
>> failed transfer by a single "Stopped TX ring 1" log entry), and when it
>> does happen brings the wireless link to a halt. Wireless traffic can be
>> restarted simply by reassociating with the AP.
>
> So it seems heavy load of RX can cause problems with TX. I just though
> of one another reason, could you test one more thing for me?
>
> Edit drivers/net/wireless/b43/dma.h and change define from value 64 to
> in the line:
> #define B43_RXRING_SLOTS                64

I meant "to 8":
#define B43_RXRING_SLOTS                8

--
Rafał

2011-08-14 20:46:28

by Chris Vine

[permalink] [raw]
Subject: Re: b43 error under heavy load

On Sun, 14 Aug 2011 13:11:12 +0200
Rafał Miłecki <[email protected]> wrote:
> 2011/8/14 Rafał Miłecki <[email protected]>:
> > Hey Chris,
> >
> > 2011/6/1 Chris Vine <[email protected]>:
> >> Summary: Traffic sent up from the broadcom wireless device
> >> generates copious reports of "Stopped TX ring 1" but always
> >> carries on with its job and stays up, although its traffic is
> >> slower than on received packets.  Received traffic on the other
> >> hand reports no errors until the spate of "Out of order TX status
> >> report on DMA ring 1" errors occurs, which seems to happen at
> >> random (albeit accompanied on my failed transfer by a single
> >> "Stopped TX ring 1" log entry), and when it does happen brings the
> >> wireless link to a halt. Wireless traffic can be restarted simply
> >> by reassociating with the AP.
> >
> > So it seems heavy load of RX can cause problems with TX. I just
> > though of one another reason, could you test one more thing for me?
> >
> > Edit drivers/net/wireless/b43/dma.h and change define from value 64
> > to in the line:
> > #define B43_RXRING_SLOTS                64
>
> I meant "to 8":
> #define B43_RXRING_SLOTS                8

It is not an improvement. Different errors occur earlier (approx 100MB
of download), upon which the wireless disassociates from the network.
dmesg produces the output at the end of this post (with MAC addresses
munged).

Chris

*****

b43-phy0 ERROR: PHY transmission error
ieee80211 phy0: wlan0: No probe response from AP 00:xx:xx:xx:xx:xx after 500ms, disconnecting.
cfg80211: Calling CRDA to update world regulatory domain
cfg80211: World regulatory domain updated:
cfg80211: (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
cfg80211: (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
cfg80211: (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
cfg80211: (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
cfg80211: (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
cfg80211: (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
cfg80211: Calling CRDA for country: JP
cfg80211: Regulatory domain changed to country: JP
cfg80211: (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
cfg80211: (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm)
cfg80211: (2457000 KHz - 2482000 KHz @ 20000 KHz), (N/A, 2000 mBm)
cfg80211: (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm)
cfg80211: (4910000 KHz - 4930000 KHz @ 10000 KHz), (N/A, 2300 mBm)
cfg80211: (4910000 KHz - 4990000 KHz @ 40000 KHz), (N/A, 2300 mBm)
cfg80211: (4930000 KHz - 4950000 KHz @ 10000 KHz), (N/A, 2300 mBm)
cfg80211: (5030000 KHz - 5045000 KHz @ 10000 KHz), (N/A, 2300 mBm)
cfg80211: (5030000 KHz - 5090000 KHz @ 40000 KHz), (N/A, 2300 mBm)
cfg80211: (5050000 KHz - 5060000 KHz @ 10000 KHz), (N/A, 2300 mBm)
cfg80211: (5170000 KHz - 5250000 KHz @ 40000 KHz), (N/A, 2000 mBm)
cfg80211: (5250000 KHz - 5330000 KHz @ 40000 KHz), (N/A, 2000 mBm)
cfg80211: (5490000 KHz - 5710000 KHz @ 40000 KHz), (N/A, 2300 mBm)
wlan0: authenticate with 00:xx:xx:xx:xx:xx (try 1)
wlan0: deauthenticating from 00:xx:xx:xx:xx:xx by local choice (reason=3)
b43-phy0 debug: Removing Interface type 2
b43-phy0 debug: Wireless interface stopped
b43-phy0 debug: DMA-64 rx_ring: Used slots 7/8, Failed frames 0/0 = 0.0%, Average tries 0.00
b43-phy0 debug: DMA-64 tx_ring_AC_BK: Used slots 0/256, Failed frames 0/0 = 0.0%, Average tries 0.00
b43-phy0 debug: DMA-64 tx_ring_AC_BE: Used slots 102/256, Failed frames 3/84586 = 0.0%, Average tries 1.12
b43-phy0 debug: DMA-64 tx_ring_AC_VI: Used slots 0/256, Failed frames 0/0 = 0.0%, Average tries 0.00
b43-phy0 debug: DMA-64 tx_ring_AC_VO: Used slots 2/256, Failed frames 4/606 = 0.6%, Average tries 1.04
b43-phy0 debug: DMA-64 tx_ring_mcast: Used slots 0/256, Failed frames 0/0 = 0.0%, Average tries 0.00
b43-phy0: Loading firmware version 508.1084 (2009-01-14 01:32:01)
b43-phy0 debug: b2062: Using crystal tab entry 19200 kHz.
b43-phy0 debug: Chip initialized
b43-phy0 debug: 64-bit DMA initialized
b43-phy0 debug: QoS enabled
b43-phy0 debug: Wireless interface started
b43-phy0 debug: Adding Interface type 2
ADDRCONF(NETDEV_UP): wlan0: link is not ready
wlan0: authenticate with 00:xx:xx:xx:xx:xx (try 1)
wlan0: authenticated
wlan0: associate with 00:xx:xx:xx:xx:xx (try 1)
wlan0: RX AssocResp from 00:xx:xx:xx:xx:xx (capab=0x411 status=12 aid=0)
wlan0: 00:xx:xx:xx:xx:xx denied association (code=12)
wlan0: deauthenticating from 00:xx:xx:xx:xx:xx by local choice (reason=3)
wlan0: deauthenticating from 00:xx:xx:xx:xx:xx by local choice (reason=3)
b43-phy0 debug: Removing Interface type 2
b43-phy0 debug: Wireless interface stopped
b43-phy0 debug: DMA-64 rx_ring: Used slots 1/8, Failed frames 0/0 = 0.0%, Average tries 0.00
b43-phy0 debug: DMA-64 tx_ring_AC_BK: Used slots 0/256, Failed frames 0/0 = 0.0%, Average tries 0.00
b43-phy0 debug: DMA-64 tx_ring_AC_BE: Used slots 0/256, Failed frames 0/0 = 0.0%, Average tries 0.00
b43-phy0 debug: DMA-64 tx_ring_AC_VI: Used slots 0/256, Failed frames 0/0 = 0.0%, Average tries 0.00
b43-phy0 debug: DMA-64 tx_ring_AC_VO: Used slots 2/256, Failed frames 0/4 = 0.0%, Average tries 1.00
b43-phy0 debug: DMA-64 tx_ring_mcast: Used slots 0/256, Failed frames 0/0 = 0.0%, Average tries 0.00



2011-08-14 20:54:20

by Rafał Miłecki

[permalink] [raw]
Subject: Re: b43 error under heavy load

2011/8/14 Chris Vine <[email protected]>:
> On Sun, 14 Aug 2011 13:11:12 +0200
> Rafał Miłecki <[email protected]> wrote:
>> 2011/8/14 Rafał Miłecki <[email protected]>:
>> > Hey Chris,
>> >
>> > 2011/6/1 Chris Vine <[email protected]>:
>> >> Summary: Traffic sent up from the broadcom wireless device
>> >> generates copious reports of "Stopped TX ring 1" but always
>> >> carries on with its job and stays up, although its traffic is
>> >> slower than on received packets.  Received traffic on the other
>> >> hand reports no errors until the spate of "Out of order TX status
>> >> report on DMA ring 1" errors occurs, which seems to happen at
>> >> random (albeit accompanied on my failed transfer by a single
>> >> "Stopped TX ring 1" log entry), and when it does happen brings the
>> >> wireless link to a halt. Wireless traffic can be restarted simply
>> >> by reassociating with the AP.
>> >
>> > So it seems heavy load of RX can cause problems with TX. I just
>> > though of one another reason, could you test one more thing for me?
>> >
>> > Edit drivers/net/wireless/b43/dma.h and change define from value 64
>> > to in the line:
>> > #define B43_RXRING_SLOTS                64
>>
>> I meant "to 8":
>> #define B43_RXRING_SLOTS                8
>
> It is not an improvement.  Different errors occur earlier (approx 100MB
> of download), upon which the wireless disassociates from the network.
> dmesg produces the output at the end of this post (with MAC addresses
> munged).

I didn't expect improvement, I tried to make it worse :P I though our
problems may be coming from not-fast-enough reading RX ring. So you
just made that ring even smaller to allow faster/easier errors.

What do you mean by different errors? The only one I can see is
"b43-phy0 ERROR: PHY transmission error". Do you mean that or
something else? Did you post all the log you wanted to? Maybe missed
beginning?

--
Rafał