2009-10-11 14:29:36

by Quintin Pitts

[permalink] [raw]
Subject: [RFC] p54pci: skb_over_panic, soft lockup, stall under flood

Hi,

Sorry for my lack of experience in all aspects - first time
submitting!!!

In trying to get p54pci driver to be stable on my platform and hardware
- here is a generic patch that seems to accomplish that. Since the
ViewSonic V210 uses the IT8152 pci bridge - some attention was needed to
get dma related allocation in the first physical 64M. I have verified
that the dma related allocation is in the first 64M and dmabounce is not
being used - just for those wondering if that was part of the problems.

Platform: ViewSonic V210 arm pxa255
Kernel 2.6.30.5 eabi
Wireless Drivers from compat-wireless-2009-09-30 and what I applied the below patch to.
Firmware used: FW rev 2.13.12.0 - Softmac protocol 5.9

Wireless card: GemTek WL-850FJB minipci card.

phy0: p54 detected a LM86 firmware
p54: rx_mtu reduced from 3240 to 2376
phy0: FW rev 2.13.12.0 - Softmac protocol 5.9
phy0: cryptographic accelerator WEP:YES, TKIP:YES, CCMP:YES
phy0: hwaddr 00:90:4b:c1:06:bc, MAC:isl3890 RF:Frisbee
phy0: Selected rate control algorithm 'minstrel'

device pci info (lspci -v):

00:06.0 Network controller: Intersil Corporation ISL3886 [Prism Javelin/Prism Xbow] (rev 01)
Subsystem: Intersil Corporation Device 0000
Flags: bus master, medium devsel, latency 56, IRQ 217
Memory at 11000000 (32-bit, non-prefetchable) [size=8K]
Capabilities: [dc] Power Management version 1
Kernel driver in use: p54pci
Kernel modules: prism54, p54pci

Reasons for patch was to solve the below problems.

1. p54p_check_rx_ring - skb_over_panic: Under a ping flood or just left
running for a bit would panic with a skb_over_panic. Investigation
showed for some odd reason the device/firmware instead of writing a
length in the data rx_ring (desc->len) had instead written the whole dma
address (host->host_addr) into location of the len/flag (host->len and
host->flags) spot and the same dma address that was in the ring. Added
the following condition in p54p_check_rx_ring to trap that condition and
trim the skb reset the len and flags only. By the way - I used haret to
see if it I could prove it happening under wince - located the dma
memory that was being used for rings - and also happening under windows
ce with the len/flag being set to the same as the host dma. Scanning
the ring at 1000 times per second (I think) In a flood or iperf. Would
see an occasional len/flag location get set to the same host address in
that ring - may only happen a few times every minute. Under normal
operation maybe a few times a day.

if(unlikely(len == (desc->host_addr & 0xffff)
&& (desc->flags == ((desc->host_addr & 0xffff0000) >> 16))) )

2. p54p_refill_rx_ring - eventual stall: Has the potential in very busy
(flood) to over run the last rx data processed ring index corrupting the
next rings - causing some havoc of getting some 13 indexes difference
between priv->rx_idx_data and ring_control host_idx on a 8 index ring.
This appears to eventually fill up the TX queue - returning a -ENOSPC in
p54_assign_address (txrx.c) because of ring corruption missing some TX
releases. Changed p54p_refill_rx_ring to take a index parm and use that
as the last processed ring index - instead of the using the ring_control
device_idx.

3. p54p_check_rx_ring - eventual stall: On ping flood - Control
P54_CONTROL_TYPE_TXDONE rx packets that are skb reused - seem to cause a
problem on the next time around with the same index. Even though the
length was not the same was still being seen as a
P54_CONTROL_TYPE_TXDONE packet again. Side affects varied - one being
the main end result same as the #2 listed above TX not being released
and returning a -ENOSPC in p54_assign_address (txrx.c) - stall.
Problem went away if did not reuse the skb but unmap it and
dev_kfree_skb if return was zero from p54_rx. Still unclear why this
would be - but had no problems with patch afterwards.

4. p54p_check_rx_ring - soft lockup in p54p_refill_rx_ring. This only
occurred when 5 minute iperf on a fast wireless network - Or 1 to 2 days
of unit left up. Discovered that the device had lost it's mind and set
the ring_control->device_index[ring_index] exactly 0xFF or 255 less than
it should be (ram issue??) don't know. Happens on three of my devices
the same way. If left to continue - the p54p_refill_rx_ring while loop
goes negative and soft lockup. Trap and return if device_idx - (*index)
greater than ring_index. Error is only tripped the one time - meaning
the next time p54p_check_rx_ring is called the device index is back to
what it should have been.

5. p54p_open - 1 out of 10 boots will produce device does not
respond! or Cannot boot firmware!. Minor - but frustrating all the
same.
Always rmmod p54pci and then modprobe p54pci works. It seems if get a
error on p54p_open trying again works. And if p54_read_eeprom fails -
trying again works.

The below was applied to compat-wireless-2009-09-30:

Thanks,

Quintin.

Signed-off-by: Quintin Pitts <[email protected]>

---

--- a/drivers/net/wireless/p54/p54pci.c 2009-09-29 23:13:58.000000000 -0500
+++ b/drivers/net/wireless/p54/p54pci.c 2009-10-09 08:15:58.000000000 -0500
@@ -131,7 +131,7 @@ static int p54p_upload_firmware(struct i

static void p54p_refill_rx_ring(struct ieee80211_hw *dev,
int ring_index, struct p54p_desc *ring, u32 ring_limit,
- struct sk_buff **rx_buf)
+ struct sk_buff **rx_buf, u32 index)
{
struct p54p_priv *priv = dev->priv;
struct p54p_ring_control *ring_control = priv->ring_control;
@@ -139,7 +139,11 @@ static void p54p_refill_rx_ring(struct i

idx = le32_to_cpu(ring_control->host_idx[ring_index]);
limit = idx;
- limit -= le32_to_cpu(ring_control->device_idx[ring_index]);
+/*
+ * Use last processed index instead of device_idx
+ * so we don't corrupt our ring
+ */
+ limit -= le32_to_cpu(index);
limit = ring_limit - limit;

i = idx % ring_limit;
@@ -181,9 +185,26 @@ static void p54p_check_rx_ring(struct ie
struct p54p_ring_control *ring_control = priv->ring_control;
struct p54p_desc *desc;
u32 idx, i;
+ int ret;

+ idx = le32_to_cpu(ring_control->device_idx[ring_index]);
i = (*index) % ring_limit;
- (*index) = idx = le32_to_cpu(ring_control->device_idx[ring_index]);
+ if(unlikely((idx - (*index)) > ring_limit ||
+ (le32_to_cpu(ring_control->host_idx[ring_index]) - (*index)) > ring_limit)) {
+ printk(KERN_DEBUG "%s: devidx jumped *index=%d devidx=%d hostidx=%d ring_limit=%d\n",
+ __func__,(*index),idx,ring_control->host_idx[ring_index],ring_limit);
+/*
+ * Do nothing things are really wrong - device index has jumped got corrupted
+ * - wait for it to stabilize
+ * So far device idx exactly 0xFF (255) bytes less than what it should be.
+ * only seen to happen on very fast wireless and packet floods and/or iperf test
+ * In testing this error only encountered once - so next time around the
+ * device index is correct.
+ * if to continue would soft lockup/hang in while loop in p54p_refill_rx_ring
+ */
+ return;
+ }
+ (*index) = idx;
idx %= ring_limit;
while (i != idx) {
u16 len;
@@ -197,25 +218,40 @@ static void p54p_check_rx_ring(struct ie
i %= ring_limit;
continue;
}
+ if(unlikely(len == (desc->host_addr & 0xffff)
+ && (desc->flags == ((desc->host_addr & 0xffff0000) >> 16))) ) {
+/* device has put device dma in desc len/flag location - will crash in skb_put
+ * desc->len and desc->flags contain the host_addr -
+ * trap before skb_put and discard
+ * ViewSonic V210 and wireless card GENTEK WL-850 , IT8152 PCI bridge
+ * happens occasionally - no clear reason or frequency.
+ *
+ */
+ printk(KERN_DEBUG "%s: rx_ring len/flags has address - skipping!\n",__func__);
+ skb_trim(skb,0);
+ desc->len = cpu_to_le16(priv->common.rx_mtu + 32);
+ desc->flags=0;
+
+ } else {
+
skb_put(skb, len);

- if (p54_rx(dev, skb)) {
- pci_unmap_single(priv->pdev,
+ ret=p54_rx(dev,skb);
+ pci_unmap_single(priv->pdev,
le32_to_cpu(desc->host_addr),
priv->common.rx_mtu + 32,
PCI_DMA_FROMDEVICE);
- rx_buf[i] = NULL;
- desc->host_addr = 0;
- } else {
- skb_trim(skb, 0);
- desc->len = cpu_to_le16(priv->common.rx_mtu + 32);
- }
+ if(ret==0)
+ dev_kfree_skb(skb);
+ rx_buf[i] = NULL;
+ desc->host_addr = 0;
+ } /* end of desc->len skb corrupt crash test */

i++;
i %= ring_limit;
}

- p54p_refill_rx_ring(dev, ring_index, ring, ring_limit, rx_buf);
+ p54p_refill_rx_ring(dev, ring_index, ring, ring_limit, rx_buf, (*index));
}

/* caller must hold priv->lock */
@@ -428,10 +464,10 @@ static int p54p_open(struct ieee80211_hw
priv->rx_idx_mgmt = priv->tx_idx_mgmt = 0;

p54p_refill_rx_ring(dev, 0, priv->ring_control->rx_data,
- ARRAY_SIZE(priv->ring_control->rx_data), priv->rx_buf_data);
+ ARRAY_SIZE(priv->ring_control->rx_data), priv->rx_buf_data, 0);

p54p_refill_rx_ring(dev, 2, priv->ring_control->rx_mgmt,
- ARRAY_SIZE(priv->ring_control->rx_mgmt), priv->rx_buf_mgmt);
+ ARRAY_SIZE(priv->ring_control->rx_mgmt), priv->rx_buf_mgmt, 0);

P54P_WRITE(ring_control_base, cpu_to_le32(priv->ring_control_dma));
P54P_READ(ring_control_base);
@@ -550,9 +586,26 @@ static int __devinit p54p_probe(struct p
}

err = p54p_open(dev);
- if (err)
- goto err_free_common;
+ if (err) {
+
+ printk(KERN_DEBUG "%s: p54p_open failed - trying again\n",__func__);
+ msleep(10);
+ err = p54p_open(dev);
+ if (err)
+ goto err_free_common;
+ }
err = p54_read_eeprom(dev);
+ if (err)
+ {
+ printk(KERN_DEBUG "%s: p54_read_eeprom failed - trying again\n",__func__);
+ p54p_stop(dev);
+ err = p54p_open(dev);
+ if (err)
+ goto err_free_common;
+ msleep(10);
+ err = p54_read_eeprom(dev);
+
+ }
p54p_stop(dev);
if (err)
goto err_free_common;



2009-10-12 00:10:08

by Quintin Pitts

[permalink] [raw]
Subject: Re: [RFC] p54pci: skb_over_panic, soft lockup, stall under flood

On Sun Oct 11 2009 10:31:42 GMT-0500 (CDT), Larry Finger wrote:
> On 10/11/2009 09:28 AM, Quintin Pitts wrote:
>> Hi,
>>
>> Sorry for my lack of experience in all aspects - first time
>> submitting!!!
>
> Everyone that goes through this "right of passage" gets somewhat
> discouraged by the response. My advice is to hang in.
>
> My first advice is for you to run every submitted patch through the
> check at scripts/checkpatch.pl. This one shows 95 errors and 7
> warnings in 136 lines. Most of the errors are due to "DOS line
> endings". We really hate carriage returns - a really useless occupier
> of space unless it is _NOT_ followed by \n!

Thanks for the advice!


> As I understand it, this patch is to fix the driver to work around
> firmware errors. If that is correct, please state that clearly. If
> only partially correct, then indicate which parts are to fix firmware
> errors, and which are to fix driver errors. Has your analysis included
> thinking about where the driver might delay to avoid firmware problems.

I think Christian has hit the nail on the head. Mostly flaky hardware
or implementation (it8152 pci bridge) when pushed.

>
> I will let Christian comment more on the technical merits of the patch
> as he understands the device much better than I do.
>
> Larry

Thanks,

Quintin.

2009-10-12 00:27:29

by Quintin Pitts

[permalink] [raw]
Subject: Re: [RFC] p54pci: skb_over_panic, soft lockup, stall under flood

On Sun Oct 11 2009 14:41:01 GMT-0500 (CDT), Christian Lamparter wrote:
> On Sunday 11 October 2009 17:31:42 Larry Finger wrote:
>>> In trying to get p54pci driver to be stable on my platform and hardware
>>> - here is a generic patch that seems to accomplish that. Since the
>>> ViewSonic V210 uses the IT8152 pci bridge - some attention was needed to
>>> get dma related allocation in the first physical 64M. I have verified
>>> that the dma related allocation is in the first 64M and dmabounce is not
>>> being used - just for those wondering if that was part of the problems.
>
> it8152 was an important bit:
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0702.1/0645.html
>
> the commit sparked a discussion about it8152 pci reliability.
> It doesn't look good:
>
> (commit author):
> "I have no idea if it's possible to get a reliable PCI bus or
> not with this chip. Right now, we only use it for it's built-in OHCI
> USB host controller and UART. You're making me hope I never have to
> use it for interfacing a PCI card!"
>
> ( http://lkml.indiana.edu/hypermail/linux/kernel/0702.1/1907.html )
>
> [...]
>
> "Well on the system on a board we were trying, using the development
> baseboard from the same supplier, by simply doing a ping flood through
> the onboard rtl8139 I managed to get corrupted ethernet packets fairly
> frequently. "
> ( http://lkml.indiana.edu/hypermail/linux/kernel/0702.1/1917.html )
>
> the sad conclusion is that: no matter what fixes you throw
> at the driver (BTW: isl38xx isn't 100% pci v2.1 compliance either)
> your chances of getting the device (with the softmac fw)
> working properly with that board are next to... not,
> unless you can magically fix the pci-bridge issues.

I feared as much.

At least that patch has given me hope. It survives a iperf and ping
flood now with out going belly up. Device has been up for three days
now with network still working. But likely glitches will show up.

iperf tests at about 9.6 to 15 Mbits/sec on a 802.11g WPA network which is
about 2 times faster than what I had under WinCE.

web browsing and ssh and scp all work great at the moment.

Time will tell...

Thanks much!

Quintin.


2009-10-12 13:35:32

by Quintin Pitts

[permalink] [raw]
Subject: Re: [RFC] p54pci: skb_over_panic, soft lockup, stall under flood

On Mon Oct 12 2009 03:57:10 GMT-0500 (CDT), Christian Lamparter wrote:
> On Monday 12 October 2009 02:09:22 Quintin Pitts wrote:
>> On Sun Oct 11 2009 10:31:42 GMT-0500 (CDT), Larry Finger wrote:
>>> On 10/11/2009 09:28 AM, Quintin Pitts wrote:
>>> As I understand it, this patch is to fix the driver to work around
>>> firmware errors. If that is correct, please state that clearly. If
>>> only partially correct, then indicate which parts are to fix firmware
>>> errors, and which are to fix driver errors. Has your analysis included
>>> thinking about where the driver might delay to avoid firmware problems.
>> I think Christian has hit the nail on the head.
>> Mostly flaky hardware or implementation (it8152 pci bridge) when pushed.
>
> hmm, flaky hardware or simply incomplete linux support.
> you've said that it works with win ce,
> so I suppose the code is missing some important bits.
>
> e.g.:
>
> 00:06.0 Network controller: Intersil Corporation ISL3886 [Prism Javelin/Prism Xbow] (rev 01)
> Subsystem: Intersil Corporation Device 0000
> Flags: bus master, medium devsel, latency 56, IRQ 217
>
> by the looks of it, something could wrong with the latency timer value.
> (what's lspci -vvnnxxx output for the card?)

00:06.0 Network controller [0280]: Intersil Corporation ISL3886 [Prism Javelin/Prism Xbow] [1260:3886] (rev 01)
Subsystem: Intersil Corporation Device [1260:0000]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 56 (2500ns min, 7000ns max), Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 217
Region 0: Memory at 11000000 (32-bit, non-prefetchable) [size=8K]
Capabilities: [dc] Power Management version 1
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Kernel modules: prism54, p54pci
00: 60 12 86 38 06 01 90 02 01 00 80 02 08 38 00 00
10: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 01 08 00 00 60 12 00 00
30: 00 00 00 00 dc 00 00 00 00 00 00 00 d9 01 0a 1c
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 01 00 01 fe
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

After the driver is up with the p54-latency patch:

00:06.0 Network controller [0280]: Intersil Corporation ISL3886 [Prism Javelin/Prism Xbow] [1260:3886] (rev 01)
Subsystem: Intersil Corporation Device [1260:0000]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
Latency: 80 (2500ns min, 7000ns max), Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 217
Region 0: Memory at 11000000 (32-bit, non-prefetchable) [size=8K]
Capabilities: [dc] Power Management version 1
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: p54pci
Kernel modules: prism54, p54pci
00: 60 12 86 38 16 01 98 02 01 00 80 02 08 50 00 00
10: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 01 08 00 00 60 12 00 00
30: 00 00 00 00 dc 00 00 00 00 00 00 00 d9 01 0a 1c
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 01 00 01 fe
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

>
> I've attached a minimal patch which c&p some latency-timer related logic
> from the original prism54 driver code to p54pci.
> Can you please give this a try?

Still same weird results with p54-latency patch. Triggering my condition
statements. under a 3 minute iperf test.

p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: index > ring_index diff *index=176894 devidx=176640 hostidx=
176901 ring_limit=8 Returning call!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!
p54p_check_rx_ring: rx_ring len/flags has address - skipping!

> I don't have a p54pci available for testing right now.
>
> Regards,
> Chr

Thanks,

Quintin.

2009-10-12 08:57:56

by Christian Lamparter

[permalink] [raw]
Subject: Re: [RFC] p54pci: skb_over_panic, soft lockup, stall under flood

On Monday 12 October 2009 02:09:22 Quintin Pitts wrote:
> On Sun Oct 11 2009 10:31:42 GMT-0500 (CDT), Larry Finger wrote:
> > On 10/11/2009 09:28 AM, Quintin Pitts wrote:
> > As I understand it, this patch is to fix the driver to work around
> > firmware errors. If that is correct, please state that clearly. If
> > only partially correct, then indicate which parts are to fix firmware
> > errors, and which are to fix driver errors. Has your analysis included
> > thinking about where the driver might delay to avoid firmware problems.
>
> I think Christian has hit the nail on the head.
> Mostly flaky hardware or implementation (it8152 pci bridge) when pushed.

hmm, flaky hardware or simply incomplete linux support.
you've said that it works with win ce,
so I suppose the code is missing some important bits.

e.g.:

00:06.0 Network controller: Intersil Corporation ISL3886 [Prism Javelin/Prism Xbow] (rev 01)
Subsystem: Intersil Corporation Device 0000
Flags: bus master, medium devsel, latency 56, IRQ 217

by the looks of it, something could wrong with the latency timer value.
(what's lspci -vvnnxxx output for the card?)

I've attached a minimal patch which c&p some latency-timer related logic
from the original prism54 driver code to p54pci.
Can you please give this a try?
I don't have a p54pci available for testing right now.

Regards,
Chr


Attachments:
p54-latency.diff (865.00 B)

2009-10-11 15:32:22

by Larry Finger

[permalink] [raw]
Subject: Re: [RFC] p54pci: skb_over_panic, soft lockup, stall under flood

On 10/11/2009 09:28 AM, Quintin Pitts wrote:
> Hi,
>
> Sorry for my lack of experience in all aspects - first time
> submitting!!!

Everyone that goes through this "right of passage" gets somewhat
discouraged by the response. My advice is to hang in.

My first advice is for you to run every submitted patch through the
check at scripts/checkpatch.pl. This one shows 95 errors and 7
warnings in 136 lines. Most of the errors are due to "DOS line
endings". We really hate carriage returns - a really useless occupier
of space unless it is _NOT_ followed by \n!

As I understand it, this patch is to fix the driver to work around
firmware errors. If that is correct, please state that clearly. If
only partially correct, then indicate which parts are to fix firmware
errors, and which are to fix driver errors. Has your analysis included
thinking about where the driver might delay to avoid firmware problems.

> In trying to get p54pci driver to be stable on my platform and hardware
> - here is a generic patch that seems to accomplish that. Since the
> ViewSonic V210 uses the IT8152 pci bridge - some attention was needed to
> get dma related allocation in the first physical 64M. I have verified
> that the dma related allocation is in the first 64M and dmabounce is not
> being used - just for those wondering if that was part of the problems.
>
> Platform: ViewSonic V210 arm pxa255
> Kernel 2.6.30.5 eabi
> Wireless Drivers from compat-wireless-2009-09-30 and what I applied the below patch to.
> Firmware used: FW rev 2.13.12.0 - Softmac protocol 5.9
>
> Wireless card: GemTek WL-850FJB minipci card.
>
> phy0: p54 detected a LM86 firmware
> p54: rx_mtu reduced from 3240 to 2376
> phy0: FW rev 2.13.12.0 - Softmac protocol 5.9
> phy0: cryptographic accelerator WEP:YES, TKIP:YES, CCMP:YES
> phy0: hwaddr 00:90:4b:c1:06:bc, MAC:isl3890 RF:Frisbee
> phy0: Selected rate control algorithm 'minstrel'
>
> device pci info (lspci -v):
>
> 00:06.0 Network controller: Intersil Corporation ISL3886 [Prism Javelin/Prism Xbow] (rev 01)
> Subsystem: Intersil Corporation Device 0000
> Flags: bus master, medium devsel, latency 56, IRQ 217
> Memory at 11000000 (32-bit, non-prefetchable) [size=8K]
> Capabilities: [dc] Power Management version 1
> Kernel driver in use: p54pci
> Kernel modules: prism54, p54pci

Mush of the above is useless detail. Stating the device and the
platform should be sufficient.

> Reasons for patch was to solve the below problems.
>
> 1. p54p_check_rx_ring - skb_over_panic: Under a ping flood or just left
> running for a bit would panic with a skb_over_panic. Investigation
> showed for some odd reason the device/firmware instead of writing a
> length in the data rx_ring (desc->len) had instead written the whole dma
> address (host->host_addr) into location of the len/flag (host->len and
> host->flags) spot and the same dma address that was in the ring. Added
> the following condition in p54p_check_rx_ring to trap that condition and
> trim the skb reset the len and flags only. By the way - I used haret to
> see if it I could prove it happening under wince - located the dma
> memory that was being used for rings - and also happening under windows
> ce with the len/flag being set to the same as the host dma. Scanning
> the ring at 1000 times per second (I think) In a flood or iperf. Would
> see an occasional len/flag location get set to the same host address in
> that ring - may only happen a few times every minute. Under normal
> operation maybe a few times a day.
>
> if(unlikely(len == (desc->host_addr & 0xffff)
> && (desc->flags == ((desc->host_addr & 0xffff0000) >> 16))) )
>
> 2. p54p_refill_rx_ring - eventual stall: Has the potential in very busy
> (flood) to over run the last rx data processed ring index corrupting the
> next rings - causing some havoc of getting some 13 indexes difference
> between priv->rx_idx_data and ring_control host_idx on a 8 index ring.
> This appears to eventually fill up the TX queue - returning a -ENOSPC in
> p54_assign_address (txrx.c) because of ring corruption missing some TX
> releases. Changed p54p_refill_rx_ring to take a index parm and use that
> as the last processed ring index - instead of the using the ring_control
> device_idx.
>
> 3. p54p_check_rx_ring - eventual stall: On ping flood - Control
> P54_CONTROL_TYPE_TXDONE rx packets that are skb reused - seem to cause a
> problem on the next time around with the same index. Even though the
> length was not the same was still being seen as a
> P54_CONTROL_TYPE_TXDONE packet again. Side affects varied - one being
> the main end result same as the #2 listed above TX not being released
> and returning a -ENOSPC in p54_assign_address (txrx.c) - stall.
> Problem went away if did not reuse the skb but unmap it and
> dev_kfree_skb if return was zero from p54_rx. Still unclear why this
> would be - but had no problems with patch afterwards.
>
> 4. p54p_check_rx_ring - soft lockup in p54p_refill_rx_ring. This only
> occurred when 5 minute iperf on a fast wireless network - Or 1 to 2 days
> of unit left up. Discovered that the device had lost it's mind and set
> the ring_control->device_index[ring_index] exactly 0xFF or 255 less than
> it should be (ram issue??) don't know. Happens on three of my devices
> the same way. If left to continue - the p54p_refill_rx_ring while loop
> goes negative and soft lockup. Trap and return if device_idx - (*index)
> greater than ring_index. Error is only tripped the one time - meaning
> the next time p54p_check_rx_ring is called the device index is back to
> what it should have been.
>
> 5. p54p_open - 1 out of 10 boots will produce device does not
> respond! or Cannot boot firmware!. Minor - but frustrating all the
> same.
> Always rmmod p54pci and then modprobe p54pci works. It seems if get a
> error on p54p_open trying again works. And if p54_read_eeprom fails -
> trying again works.
>
> The below was applied to compat-wireless-2009-09-30:
>
> Thanks,
>
> Quintin.
>
> Signed-off-by: Quintin Pitts <[email protected]>
>
> ---
>
> --- a/drivers/net/wireless/p54/p54pci.c 2009-09-29 23:13:58.000000000 -0500
> +++ b/drivers/net/wireless/p54/p54pci.c 2009-10-09 08:15:58.000000000 -0500
> @@ -131,7 +131,7 @@ static int p54p_upload_firmware(struct i
>
> static void p54p_refill_rx_ring(struct ieee80211_hw *dev,
> int ring_index, struct p54p_desc *ring, u32 ring_limit,
> - struct sk_buff **rx_buf)
> + struct sk_buff **rx_buf, u32 index)
> {
> struct p54p_priv *priv = dev->priv;
> struct p54p_ring_control *ring_control = priv->ring_control;
> @@ -139,7 +139,11 @@ static void p54p_refill_rx_ring(struct i
>
> idx = le32_to_cpu(ring_control->host_idx[ring_index]);
> limit = idx;
> - limit -= le32_to_cpu(ring_control->device_idx[ring_index]);
> +/*
> + * Use last processed index instead of device_idx
> + * so we don't corrupt our ring
> + */
> + limit -= le32_to_cpu(index);
> limit = ring_limit - limit;
>
> i = idx % ring_limit;
> @@ -181,9 +185,26 @@ static void p54p_check_rx_ring(struct ie
> struct p54p_ring_control *ring_control = priv->ring_control;
> struct p54p_desc *desc;
> u32 idx, i;
> + int ret;
>
> + idx = le32_to_cpu(ring_control->device_idx[ring_index]);
> i = (*index) % ring_limit;
> - (*index) = idx = le32_to_cpu(ring_control->device_idx[ring_index]);
> + if(unlikely((idx - (*index)) > ring_limit ||
> + (le32_to_cpu(ring_control->host_idx[ring_index]) - (*index)) > ring_limit)) {

The indentation in this section is strange.

> + printk(KERN_DEBUG "%s: devidx jumped *index=%d devidx=%d hostidx=%d ring_limit=%d\n",
> + __func__,(*index),idx,ring_control->host_idx[ring_index],ring_limit);
> +/*
> + * Do nothing things are really wrong - device index has jumped got corrupted
> + * - wait for it to stabilize
> + * So far device idx exactly 0xFF (255) bytes less than what it should be.
> + * only seen to happen on very fast wireless and packet floods and/or iperf test
> + * In testing this error only encountered once - so next time around the
> + * device index is correct.
> + * if to continue would soft lockup/hang in while loop in p54p_refill_rx_ring
> + */
> + return;
> + }

This section looks like one where a driver delay might resolve the
problem. I have no objections to your fix. I'm just curious.

> + (*index) = idx;
> idx %= ring_limit;
> while (i != idx) {
> u16 len;
> @@ -197,25 +218,40 @@ static void p54p_check_rx_ring(struct ie
> i %= ring_limit;
> continue;
> }
> + if(unlikely(len == (desc->host_addr & 0xffff)
> + && (desc->flags == ((desc->host_addr & 0xffff0000) >> 16))) ) {

You have a whitespace problem (space after if) and again an
indentation problem. Usually, the form is as follows:

if (unlikely(len == (desc->host_addr & 0xffff) &&
(desc->flags....

> +/* device has put device dma in desc len/flag location - will crash in skb_put
> + * desc->len and desc->flags contain the host_addr -
> + * trap before skb_put and discard
> + * ViewSonic V210 and wireless card GENTEK WL-850 , IT8152 PCI bridge
> + * happens occasionally - no clear reason or frequency.
> + *
> + */
> + printk(KERN_DEBUG "%s: rx_ring len/flags has address - skipping!\n",__func__);
> + skb_trim(skb,0);
> + desc->len = cpu_to_le16(priv->common.rx_mtu + 32);
> + desc->flags=0;

There is an indentation problem here.

> +
> + } else {
> +
> skb_put(skb, len);
>
> - if (p54_rx(dev, skb)) {
> - pci_unmap_single(priv->pdev,
> + ret=p54_rx(dev,skb);
> + pci_unmap_single(priv->pdev,

Here too.

> le32_to_cpu(desc->host_addr),
> priv->common.rx_mtu + 32,
> PCI_DMA_FROMDEVICE);
> - rx_buf[i] = NULL;
> - desc->host_addr = 0;
> - } else {
> - skb_trim(skb, 0);
> - desc->len = cpu_to_le16(priv->common.rx_mtu + 32);
> - }
> + if(ret==0)
> + dev_kfree_skb(skb);
> + rx_buf[i] = NULL;
> + desc->host_addr = 0;
> + } /* end of desc->len skb corrupt crash test */
>
> i++;
> i %= ring_limit;
> }
>
> - p54p_refill_rx_ring(dev, ring_index, ring, ring_limit, rx_buf);
> + p54p_refill_rx_ring(dev, ring_index, ring, ring_limit, rx_buf, (*index));
> }
>
> /* caller must hold priv->lock */
> @@ -428,10 +464,10 @@ static int p54p_open(struct ieee80211_hw
> priv->rx_idx_mgmt = priv->tx_idx_mgmt = 0;
>
> p54p_refill_rx_ring(dev, 0, priv->ring_control->rx_data,
> - ARRAY_SIZE(priv->ring_control->rx_data), priv->rx_buf_data);
> + ARRAY_SIZE(priv->ring_control->rx_data), priv->rx_buf_data, 0);
>
> p54p_refill_rx_ring(dev, 2, priv->ring_control->rx_mgmt,
> - ARRAY_SIZE(priv->ring_control->rx_mgmt), priv->rx_buf_mgmt);
> + ARRAY_SIZE(priv->ring_control->rx_mgmt), priv->rx_buf_mgmt, 0);
>
> P54P_WRITE(ring_control_base, cpu_to_le32(priv->ring_control_dma));
> P54P_READ(ring_control_base);
> @@ -550,9 +586,26 @@ static int __devinit p54p_probe(struct p
> }
>
> err = p54p_open(dev);
> - if (err)
> - goto err_free_common;
> + if (err) {
> +
> + printk(KERN_DEBUG "%s: p54p_open failed - trying again\n",__func__);
> + msleep(10);
> + err = p54p_open(dev);
> + if (err)
> + goto err_free_common;
> + }
> err = p54_read_eeprom(dev);
> + if (err)
> + {
> + printk(KERN_DEBUG "%s: p54_read_eeprom failed - trying again\n",__func__);
> + p54p_stop(dev);
> + err = p54p_open(dev);
> + if (err)
> + goto err_free_common;
> + msleep(10);
> + err = p54_read_eeprom(dev);
> +
> + }
> p54p_stop(dev);
> if (err)
> goto err_free_common;
>

I will let Christian comment more on the technical merits of the patch
as he understands the device much better than I do.

Larry

2009-10-11 19:41:46

by Christian Lamparter

[permalink] [raw]
Subject: Re: [RFC] p54pci: skb_over_panic, soft lockup, stall under flood

On Sunday 11 October 2009 17:31:42 Larry Finger wrote:
> > In trying to get p54pci driver to be stable on my platform and hardware
> > - here is a generic patch that seems to accomplish that. Since the
> > ViewSonic V210 uses the IT8152 pci bridge - some attention was needed to
> > get dma related allocation in the first physical 64M. I have verified
> > that the dma related allocation is in the first 64M and dmabounce is not
> > being used - just for those wondering if that was part of the problems.

it8152 was an important bit:

http://lkml.indiana.edu/hypermail/linux/kernel/0702.1/0645.html

the commit sparked a discussion about it8152 pci reliability.
It doesn't look good:

(commit author):
"I have no idea if it's possible to get a reliable PCI bus or
not with this chip. Right now, we only use it for it's built-in OHCI
USB host controller and UART. You're making me hope I never have to
use it for interfacing a PCI card!"

( http://lkml.indiana.edu/hypermail/linux/kernel/0702.1/1907.html )

[...]

"Well on the system on a board we were trying, using the development
baseboard from the same supplier, by simply doing a ping flood through
the onboard rtl8139 I managed to get corrupted ethernet packets fairly
frequently. "
( http://lkml.indiana.edu/hypermail/linux/kernel/0702.1/1917.html )

the sad conclusion is that: no matter what fixes you throw
at the driver (BTW: isl38xx isn't 100% pci v2.1 compliance either)
your chances of getting the device (with the softmac fw)
working properly with that board are next to... not,
unless you can magically fix the pci-bridge issues.

Regards,
Chr