LinuxLists.cc - More data on open-source firmware crash

2009-02-22 19:10:44

Subject: More data on open-source firmware crash

Francesco and Lorenzo,

I modified my driver source to dump the firmware machine state whenever the
b43_dma_handle_txstatus routine was called with an out-of-order cookie. With
proprietary firmware, the test of a flood ping in one job and repeated tcpperf
transmissions in a second ran for 10 hours without a single "failure". With the
open-source firmware it failed after about 2 hours.

Below are the saved status data. Listed for each item are the cookie, the
sequence number, and the skb length. The 0x84 length values come from the ping.
All of the out-of-order items come from tcpperf - is it significant that they
are from the longer set? Note that a number of cookie/sequence pairs are
missing, namely: 2064/9C1, 2066/9C2, 2068/9C3, 206A/9C4, 206C/9C5, 2072/9C7,
2076/9C9, and 207A/9CB. Cookie 206E is missing, but the next sequence (9C6) was
attached to cookie 2070.

This was not the first printout, but at this point cookie/sequence pair 2086/9D2
was received. It is a duplicate of item 22, thus its skb had been deleted and
poisoned.

I don't understand the firmware, but is it possible that there is a queue
overrun, or some data in a queue are being missed?

Feb 22 00:07:59 mtech kernel: Item 28: 0x205A 0x9BC 0x84
Feb 22 00:07:59 mtech kernel: Item 27: 0x205C 0x9BD 0x84
Feb 22 00:07:59 mtech kernel: Item 26: 0x205E 0x9BE 0x84
Feb 22 00:07:59 mtech kernel: Item 25: 0x2060 0x9BF 0x84
Feb 22 00:07:59 mtech kernel: Item 24: 0x2062 0x9C0 0x84
Feb 22 00:07:59 mtech kernel: Item 23: 0x2084 0x9D1 0x60C
Feb 22 00:07:59 mtech kernel: Item 22: 0x2086 0x9D2 0x60C
Feb 22 00:07:59 mtech kernel: Item 21: 0x2088 0x9D3 0x60C
Feb 22 00:07:59 mtech kernel: Item 20: 0x208A 0x9D4 0x60C
Feb 22 00:07:59 mtech kernel: Item 19: 0x208C 0x9D5 0x60C
Feb 22 00:07:59 mtech kernel: Item 18: 0x208E 0x9D6 0x60C
Feb 22 00:07:59 mtech kernel: Item 17: 0x2070 0x9C6 0x60C
Feb 22 00:07:59 mtech kernel: Item 16: 0x2090 0x9D7 0x60C
Feb 22 00:07:59 mtech kernel: Item 15: 0x2092 0x9D8 0x60C
Feb 22 00:07:59 mtech kernel: Item 14: 0x2074 0x9C8 0x60C
Feb 22 00:07:59 mtech kernel: Item 13: 0x2094 0x9D9 0x60C
Feb 22 00:07:59 mtech kernel: Item 12: 0x2096 0x9DA 0x60C
Feb 22 00:07:59 mtech kernel: Item 11: 0x2078 0x9CA 0x60C
Feb 22 00:07:59 mtech kernel: Item 10: 0x2098 0x9DB 0x60C
Feb 22 00:07:59 mtech kernel: Item 9: 0x209A 0x9DC 0x1B4
Feb 22 00:07:59 mtech kernel: Item 8: 0x207C 0x9CC 0x60C
Feb 22 00:07:59 mtech kernel: Item 7: 0x209C 0x9DD 0x60C
Feb 22 00:07:59 mtech kernel: Item 6: 0x209E 0x9DE 0x84
Feb 22 00:07:59 mtech kernel: Item 5: 0x2080 0x9CE 0x60C
Feb 22 00:07:59 mtech kernel: Item 4: 0x20A0 0x9DF 0x84
Feb 22 00:07:59 mtech kernel: Item 3: 0x2082 0x9D0 0x60C
Feb 22 00:07:59 mtech kernel: Item 2: 0x20A2 0x9E0 0x84
Feb 22 00:07:59 mtech kernel: Item 1: 0x20A4 0x9E1 0x84

Larry

2009-02-22 19:47:27

by Francesco Gringoli

[permalink] [raw]

Subject: Re: More data on open-source firmware crash

On Feb 22, 2009, at 8:10 PM, Larry Finger wrote:

> Francesco and Lorenzo,
>
Hi Larry,

many thanks for your analysis.

> I modified my driver source to dump the firmware machine state
> whenever the
> b43_dma_handle_txstatus routine was called with an out-of-order
> cookie. With
> proprietary firmware, the test of a flood ping in one job and
> repeated tcpperf
> transmissions in a second ran for 10 hours without a single
> "failure". With the
> open-source firmware it failed after about 2 hours.
>
> Below are the saved status data. Listed for each item are the
> cookie, the
> sequence number, and the skb length. The 0x84 length values come
> from the ping.
> All of the out-of-order items come from tcpperf - is it significant
> that they
> are from the longer set? Note that a number of cookie/sequence pairs
> are
> missing, namely: 2064/9C1, 2066/9C2, 2068/9C3, 206A/9C4, 206C/9C5,
> 2072/9C7,
> 2076/9C9, and 207A/9CB. Cookie 206E is missing, but the next
> sequence (9C6) was
> attached to cookie 2070.
>
> This was not the first printout, but at this point cookie/sequence
> pair 2086/9D2
> was received. It is a duplicate of item 22, thus its skb had been
> deleted and
> poisoned.
>
> I don't understand the firmware, but is it possible that there is a
> queue
> overrun, or some data in a queue are being missed?
That could be. After reading this, I remember that there was one
condition (always false on all of our boards) that we can't understand
at that time. I'm preparing a patch: I hope you can test it as we do
not have boards that will jump on that condition.

Cheers,
-FG

>
> Feb 22 00:07:59 mtech kernel: Item 28: 0x205A 0x9BC 0x84
> Feb 22 00:07:59 mtech kernel: Item 27: 0x205C 0x9BD 0x84
> Feb 22 00:07:59 mtech kernel: Item 26: 0x205E 0x9BE 0x84
> Feb 22 00:07:59 mtech kernel: Item 25: 0x2060 0x9BF 0x84
> Feb 22 00:07:59 mtech kernel: Item 24: 0x2062 0x9C0 0x84
> Feb 22 00:07:59 mtech kernel: Item 23: 0x2084 0x9D1 0x60C
> Feb 22 00:07:59 mtech kernel: Item 22: 0x2086 0x9D2 0x60C
> Feb 22 00:07:59 mtech kernel: Item 21: 0x2088 0x9D3 0x60C
> Feb 22 00:07:59 mtech kernel: Item 20: 0x208A 0x9D4 0x60C
> Feb 22 00:07:59 mtech kernel: Item 19: 0x208C 0x9D5 0x60C
> Feb 22 00:07:59 mtech kernel: Item 18: 0x208E 0x9D6 0x60C
> Feb 22 00:07:59 mtech kernel: Item 17: 0x2070 0x9C6 0x60C
> Feb 22 00:07:59 mtech kernel: Item 16: 0x2090 0x9D7 0x60C
> Feb 22 00:07:59 mtech kernel: Item 15: 0x2092 0x9D8 0x60C
> Feb 22 00:07:59 mtech kernel: Item 14: 0x2074 0x9C8 0x60C
> Feb 22 00:07:59 mtech kernel: Item 13: 0x2094 0x9D9 0x60C
> Feb 22 00:07:59 mtech kernel: Item 12: 0x2096 0x9DA 0x60C
> Feb 22 00:07:59 mtech kernel: Item 11: 0x2078 0x9CA 0x60C
> Feb 22 00:07:59 mtech kernel: Item 10: 0x2098 0x9DB 0x60C
> Feb 22 00:07:59 mtech kernel: Item 9: 0x209A 0x9DC 0x1B4
> Feb 22 00:07:59 mtech kernel: Item 8: 0x207C 0x9CC 0x60C
> Feb 22 00:07:59 mtech kernel: Item 7: 0x209C 0x9DD 0x60C
> Feb 22 00:07:59 mtech kernel: Item 6: 0x209E 0x9DE 0x84
> Feb 22 00:07:59 mtech kernel: Item 5: 0x2080 0x9CE 0x60C
> Feb 22 00:07:59 mtech kernel: Item 4: 0x20A0 0x9DF 0x84
> Feb 22 00:07:59 mtech kernel: Item 3: 0x2082 0x9D0 0x60C
> Feb 22 00:07:59 mtech kernel: Item 2: 0x20A2 0x9E0 0x84
> Feb 22 00:07:59 mtech kernel: Item 1: 0x20A4 0x9E1 0x84
>
> Larry
Larry, it seems that

-------

Francesco Gringoli, PhD - Assistant Professor
Dept. of Electrical Engineering for Automation
University of Brescia
via Branze, 38
25123 Brescia
ITALY

Ph: ++39.030.3715843
FAX: ++39.030.380014
WWW: http://www.ing.unibs.it/~gringoli

2009-02-24 16:24:07

by Larry Finger

[permalink] [raw]

Subject: Re: More data on open-source firmware crash

Francesco,

I don't think the problem results from an overrun of the TX FIFO. In my latest
test, I added code in routine handle_irq_transmit_status() that would print a
message if the while loop count reached 16 or more. That print did not occur.
The largest queue depth reached during the run was 14. When the first
out-of-order cookie was received, the largest depth in the last 64 transactions
was only 5.

I'm going to increase the number of transactions saved for my next run. Any
other info that I should capture?

Larry

2009-02-24 16:29:33

by Francesco Gringoli

[permalink] [raw]

Subject: Re: More data on open-source firmware crash

On Feb 24, 2009, at 5:23 PM, Larry Finger wrote:

>
> Francesco,
>
> I don't think the problem results from an overrun of the TX FIFO. In
> my latest
> test, I added code in routine handle_irq_transmit_status() that
> would print a
> message if the while loop count reached 16 or more. That print did
> not occur.
> The largest queue depth reached during the run was 14. When the first
> out-of-order cookie was received, the largest depth in the last 64
> transactions
> was only 5.
>
> I'm going to increase the number of transactions saved for my next
> run. Any
> other info that I should capture?
What about drop info from dmesg? Another try could be halve the dma
queue, I believe now it is 64 slots. Could you give a try setting it
to 32?

Cheers,
-FG

>
>
> Larry

-------

Francesco Gringoli, PhD - Assistant Professor
Dept. of Electrical Engineering for Automation
University of Brescia
via Branze, 38
25123 Brescia
ITALY

Ph: ++39.030.3715843
FAX: ++39.030.380014
WWW: http://www.ing.unibs.it/~gringoli

2009-02-24 16:36:00

by Larry Finger

[permalink] [raw]

Subject: Re: More data on open-source firmware crash

Francesco Gringoli wrote:
> What about drop info from dmesg? Another try could be halve the dma
> queue, I believe now it is 64 slots. Could you give a try setting it to 32?

The RX DMA queue was 64 and TX was 256. Both will be 32 in my next run.

I don't understand the "drop info from dmesg" part.

Larry

2009-02-24 18:50:19

by Francesco Gringoli

[permalink] [raw]

Subject: Re: More data on open-source firmware crash

On Feb 24, 2009, at 7:42 PM, Larry Finger wrote:

> Francesco Gringoli wrote:
>>
>> On Feb 24, 2009, at 5:35 PM, Larry Finger wrote:
>>
>>> Francesco Gringoli wrote:
>>>> What about drop info from dmesg? Another try could be halve the dma
>>>> queue, I believe now it is 64 slots. Could you give a try setting
>>>> it
>>>> to 32?
>>>
>>> The RX DMA queue was 64 and TX was 256. Both will be 32 in my next
>>> run.
>>>
>>> I don't understand the "drop info from dmesg" part.
>> Sorry Larry,
>>
>> sometimes when I modify something, I can read
>>
>> [334415.451028] b43-phy244 debug: RX: Packet dropped
>>
>> in the dmesg because the received bytes were not correctly copied
>> between the "air-buffer" to the internal buffer and finally to the
>> rx-fifo along with the rxheader.
>>
>> Did you notice something similar?
>
> I hadn't noticed them before, but I do see one of these messages
> just before
> some of my out-of-order cookies. There is no associated diagnostic
> message, thus
> it must come from the status.rate_index test. I'll see if I can get
> some info
> from there.
>
> Larry
>
Larry, take care that they are not due to massive usage of b43-fwdump.
Sometime this utility forces the firmware to slow down and produce
such messages. If you didn't use the utility and got that message
anyway, we can try to investigate on this.

Cheers,
-FG

-------

Francesco Gringoli, PhD - Assistant Professor
Dept. of Electrical Engineering for Automation
University of Brescia
via Branze, 38
25123 Brescia
ITALY

Ph: ++39.030.3715843
FAX: ++39.030.380014
WWW: http://www.ing.unibs.it/~gringoli

2009-02-24 16:44:07

by Francesco Gringoli

[permalink] [raw]

Subject: Re: More data on open-source firmware crash

On Feb 24, 2009, at 5:35 PM, Larry Finger wrote:

> Francesco Gringoli wrote:
>> What about drop info from dmesg? Another try could be halve the dma
>> queue, I believe now it is 64 slots. Could you give a try setting
>> it to 32?
>
> The RX DMA queue was 64 and TX was 256. Both will be 32 in my next
> run.
>
> I don't understand the "drop info from dmesg" part.
Sorry Larry,

sometimes when I modify something, I can read

[334415.451028] b43-phy244 debug: RX: Packet dropped

in the dmesg because the received bytes were not correctly copied
between the "air-buffer" to the internal buffer and finally to the rx-
fifo along with the rxheader.

Did you notice something similar?

Cheers,
-FG

>
>
> Larry
>

-------

Francesco Gringoli, PhD - Assistant Professor
Dept. of Electrical Engineering for Automation
University of Brescia
via Branze, 38
25123 Brescia
ITALY

Ph: ++39.030.3715843
FAX: ++39.030.380014
WWW: http://www.ing.unibs.it/~gringoli

2009-02-24 18:42:27

by Larry Finger

[permalink] [raw]

Subject: Re: More data on open-source firmware crash

Francesco Gringoli wrote:
>
> On Feb 24, 2009, at 5:35 PM, Larry Finger wrote:
>
>> Francesco Gringoli wrote:
>>> What about drop info from dmesg? Another try could be halve the dma
>>> queue, I believe now it is 64 slots. Could you give a try setting it
>>> to 32?
>>
>> The RX DMA queue was 64 and TX was 256. Both will be 32 in my next run.
>>
>> I don't understand the "drop info from dmesg" part.
> Sorry Larry,
>
> sometimes when I modify something, I can read
>
> [334415.451028] b43-phy244 debug: RX: Packet dropped
>
> in the dmesg because the received bytes were not correctly copied
> between the "air-buffer" to the internal buffer and finally to the
> rx-fifo along with the rxheader.
>
> Did you notice something similar?

I hadn't noticed them before, but I do see one of these messages just before
some of my out-of-order cookies. There is no associated diagnostic message, thus
it must come from the status.rate_index test. I'll see if I can get some info
from there.

Larry

2009-02-22 19:23:12

by Michael Büsch

[permalink] [raw]

Subject: Re: More data on open-source firmware crash

On Sunday 22 February 2009 20:10:41 Larry Finger wrote:
> Francesco and Lorenzo,
>
> I modified my driver source to dump the firmware machine state whenever the
> b43_dma_handle_txstatus routine was called with an out-of-order cookie. With
> proprietary firmware, the test of a flood ping in one job and repeated tcpperf
> transmissions in a second ran for 10 hours without a single "failure". With the
> open-source firmware it failed after about 2 hours.
>
> Below are the saved status data. Listed for each item are the cookie, the
> sequence number, and the skb length. The 0x84 length values come from the ping.
> All of the out-of-order items come from tcpperf - is it significant that they
> are from the longer set? Note that a number of cookie/sequence pairs are
> missing, namely: 2064/9C1, 2066/9C2, 2068/9C3, 206A/9C4, 206C/9C5, 2072/9C7,
> 2076/9C9, and 207A/9CB. Cookie 206E is missing, but the next sequence (9C6) was
> attached to cookie 2070.
>
> This was not the first printout, but at this point cookie/sequence pair 2086/9D2
> was received. It is a duplicate of item 22, thus its skb had been deleted and
> poisoned.
>
> I don't understand the firmware, but is it possible that there is a queue
> overrun, or some data in a queue are being missed?

Of course this is possible, but I don't know how to verify this.

Maybe you should modify the tx status fetching loop.
I think (this is only an estimation) that the queue is about 16 entries long.
So if you are able to fetch 16 entries in a row from it, it's possible that we had
and overflow, if the firmware overflow protection mechanism failed at that point.

So you can see if the 16-entries-in-a-row and the out-of-order cookies happen at
about the same time.

Of course I don't know if the number 16 is correct. It's just an estimation.

--
Greetings, Michael.