2012-08-22 04:07:58

by Larry Finger

[permalink] [raw]
Subject: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

Hi,

The commit entitled "tcp: reduce out_of_order memory use" turns out to cause
problems with a number of USB drivers.

The first one called to my attention was for staging/r8712u. For this driver,
there are problems with SSL communications as reported in
https://bugzilla.redhat.com/show_bug.cgi?id=847525.

Other drivers including rtl8187, rtl8192cu and rt2800usb cannot connect to a
WPA1- or WEP-encrypted AP, but work fine with WPA2. The rtl8192cu problem is
reported in https://bugzilla.kernel.org/show_bug.cgi?id=46171.

I find it hard to understand why this patch should cause these effects; however,
I have verified that a kernel generated from "git checkout e86b2919" works fine.
This commit is immediately before the patch in question.

Note, this patch was applied during the merge between 3.3 and 3.4. The
regression has been undiscovered for some time.

Any help on this problem will be appreciated.

Larry


2012-08-22 04:26:41

by David Miller

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

From: Larry Finger <[email protected]>
Date: Tue, 21 Aug 2012 23:07:46 -0500

> I find it hard to understand why this patch should cause these
> effects; however, I have verified that a kernel generated from "git
> checkout e86b2919" works fine. This commit is immediately before the
> patch in question.

I can guarentee it's improper skb->truesize setting in these drivers,
or use of too huge buffer sizes compared to the amount of data
contained within the packet.

2012-08-22 05:15:11

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Tue, 2012-08-21 at 23:07 -0500, Larry Finger wrote:
> Hi,
>
> The commit entitled "tcp: reduce out_of_order memory use" turns out to cause
> problems with a number of USB drivers.
>
> The first one called to my attention was for staging/r8712u. For this driver,
> there are problems with SSL communications as reported in
> https://bugzilla.redhat.com/show_bug.cgi?id=847525.
>
> Other drivers including rtl8187, rtl8192cu and rt2800usb cannot connect to a
> WPA1- or WEP-encrypted AP, but work fine with WPA2. The rtl8192cu problem is
> reported in https://bugzilla.kernel.org/show_bug.cgi?id=46171.
>
> I find it hard to understand why this patch should cause these effects; however,
> I have verified that a kernel generated from "git checkout e86b2919" works fine.
> This commit is immediately before the patch in question.
>
> Note, this patch was applied during the merge between 3.3 and 3.4. The
> regression has been undiscovered for some time.
>
> Any help on this problem will be appreciated.
>
> Larry

This particular commit is the start of a patches batch that ended in the
generic TCP coalescing mechanism.

It is known to have problem on drivers doing skb_clone() in their rx
path.

Current kernels should be ok, because coalescing doesnt happen if the
destination skb is cloned (skb_cloned(to) in skb_try_coalesce())

For 3.4 kernel, I guess I need to backport this skb_cloned(to) check fo
stable 3.4 kernel

But these skb_clone() in various USB drivers should be killed for good,
they really can kill the box because of skb->truesize lies.

Please test following patch (for 3.4 kernels)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 257b617..c45ac2d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4496,7 +4496,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
* to avoid future tcp_collapse_ofo_queue(),
* probably the most expensive function in tcp stack.
*/
- if (skb->len <= skb_tailroom(skb1) && !tcp_hdr(skb)->fin) {
+ if (skb->len <= skb_tailroom(skb1) &&
+ !skb_cloned(skb1) &&
+ !tcp_hdr(skb)->fin) {
NET_INC_STATS_BH(sock_net(sk),
LINUX_MIB_TCPRCVCOALESCE);
BUG_ON(skb_copy_bits(skb, 0,

2012-08-22 16:00:22

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/22/2012 12:15 AM, Eric Dumazet wrote:
>
> This particular commit is the start of a patches batch that ended in the
> generic TCP coalescing mechanism.
>
> It is known to have problem on drivers doing skb_clone() in their rx
> path.
>
> Current kernels should be ok, because coalescing doesnt happen if the
> destination skb is cloned (skb_cloned(to) in skb_try_coalesce())
>
> For 3.4 kernel, I guess I need to backport this skb_cloned(to) check fo
> stable 3.4 kernel
>
> But these skb_clone() in various USB drivers should be killed for good,
> they really can kill the box because of skb->truesize lies.

Applying your patch to 3.4 does not fix the problem. In addition, all of the
problems I noted happen with 3.6-rc2 - current kernels are not OK.

Of the drivers I reported, only r8712u in staging/rtl8712/rtl8712_recv.c has a
direct call to skb_clone(). As its symptoms are different, the cloning is likely
the source of the problems there. In addition, the problem is intermittent. The
driver normally allocates a new buffer and copies to it, and the clone only
occurs when that skb allocation fails. I will try to fix that. My skb skills are
minimal and I will have to do some studying.

Thanks,

Larry

2012-08-22 21:33:28

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/22/2012 12:15 AM, Eric Dumazet wrote:
>
> This particular commit is the start of a patches batch that ended in the
> generic TCP coalescing mechanism.
>
> It is known to have problem on drivers doing skb_clone() in their rx
> path.
>
> Current kernels should be ok, because coalescing doesnt happen if the
> destination skb is cloned (skb_cloned(to) in skb_try_coalesce())

The skb_clone() call is not the source of the problem for r8712u, as it is only
used when a memory allocation fails, which is not happening. The suggestion did
lead to another patch that I had been preparing. The initial allocation of RX
buffers used a size of 30720 bytes, while 9100 is sufficient to allow
aggregation. Upon reducing the buffer size, the driver now works for me. I am
now awaiting tests by the OP on the Red Hat bugzilla before sending the patch
upstream.

So far, no ideas for the problem in connecting to WPA1 networks.

Larry

2012-08-23 04:03:22

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Wed, 2012-08-22 at 16:33 -0500, Larry Finger wrote:
> On 08/22/2012 12:15 AM, Eric Dumazet wrote:
> >
> > This particular commit is the start of a patches batch that ended in the
> > generic TCP coalescing mechanism.
> >
> > It is known to have problem on drivers doing skb_clone() in their rx
> > path.
> >
> > Current kernels should be ok, because coalescing doesnt happen if the
> > destination skb is cloned (skb_cloned(to) in skb_try_coalesce())
>
> The skb_clone() call is not the source of the problem for r8712u, as it is only
> used when a memory allocation fails, which is not happening. The suggestion did
> lead to another patch that I had been preparing. The initial allocation of RX
> buffers used a size of 30720 bytes, while 9100 is sufficient to allow
> aggregation. Upon reducing the buffer size, the driver now works for me. I am
> now awaiting tests by the OP on the Red Hat bugzilla before sending the patch
> upstream.
>
> So far, no ideas for the problem in connecting to WPA1 networks.
>

Changing the allocation size removes the problem ? thats really strange.

If you try different sizes in the 9100-30720 range, can you pinpoint the
failure threshold ?


2012-08-23 20:57:17

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/22/2012 11:03 PM, Eric Dumazet wrote:
>
> Changing the allocation size removes the problem ? thats really strange.
>
> If you try different sizes in the 9100-30720 range, can you pinpoint the
> failure threshold ?

The allocation size change did not fix the problem. It turned out that 10 tries
from a secure web page were not enough to trigger this intermittent problem that
particular test.

Based on DaveM's comment that skb->truesize could be wrong, I tried setting
truesize after every netdev_alloc_skb() call. Of course, that had no effect. I
then found https://lkml.org/lkml/2010/11/19/505I, which clearly states why this
need not be done.

What skb modifications require that truesize be adjusted? The driver never
resets skb->len or skb->data_len for any buffers, other than setting skb->len to
zero.

Thanks,

Larry

2012-08-23 21:26:49

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Thu, 2012-08-23 at 15:57 -0500, Larry Finger wrote:
> On 08/22/2012 11:03 PM, Eric Dumazet wrote:
> >
> > Changing the allocation size removes the problem ? thats really strange.
> >
> > If you try different sizes in the 9100-30720 range, can you pinpoint the
> > failure threshold ?
>
> The allocation size change did not fix the problem. It turned out that 10 tries
> from a secure web page were not enough to trigger this intermittent problem that
> particular test.
>
> Based on DaveM's comment that skb->truesize could be wrong, I tried setting
> truesize after every netdev_alloc_skb() call. Of course, that had no effect. I
> then found https://lkml.org/lkml/2010/11/19/505I, which clearly states why this
> need not be done.
>
> What skb modifications require that truesize be adjusted? The driver never
> resets skb->len or skb->data_len for any buffers, other than setting skb->len to
> zero.

skb->truesize is adjusted when a frag is added to one skb, or when
skb->head is re-allocated.

Are you sure you dont have another problem, because as I said commit
c8628155ece3 had a bug, so a bisect is not very useful.

How many reloads are needed to trigger the bug, do you have a script to
reproduce it ?

Could it be a PMTU problem ? (check
http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commit;h=9b04f350057863d1fad1ba071e09362a1da3503e )

2012-08-24 14:10:06

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/23/2012 04:26 PM, Eric Dumazet wrote:
>
> skb->truesize is adjusted when a frag is added to one skb, or when
> skb->head is re-allocated.
>
> Are you sure you dont have another problem, because as I said commit
> c8628155ece3 had a bug, so a bisect is not very useful.
>
> How many reloads are needed to trigger the bug, do you have a script to
> reproduce it ?

I was doing the tests with a browser; however, it seems that a wget of that same
page is much more effective at triggering the problem. Now I have a script that
tries 60 passes of

wget https://bugzilla.redhat.com/show_bug.cgi?id=847525

I usually get failures on the first or second try. A kernel built from "git
checkout c862815" gets the following wget output:

============================================================================
--2012-08-23 21:50:32-- https://bugzilla.redhat.com/show_bug.cgi?id=847525
Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
Connecting to bugzilla.redhat.com (bugzilla.redhat.com)|209.132.183.69|:443...
connected.
OpenSSL: error:0D0680A8:asn1 encoding routines:ASN1_CHECK_TLEN:wrong tag
OpenSSL: error:0D07803A:asn1 encoding routines:ASN1_ITEM_EX_D2I:nested asn1 error
OpenSSL: error:0D08303A:asn1 encoding routines:ASN1_TEMPLATE_NOEXP_D2I:nested
asn1 error
OpenSSL: error:0D08303A:asn1 encoding routines:ASN1_TEMPLATE_NOEXP_D2I:nested
asn1 error
OpenSSL: error:1409000D:SSL routines:SSL3_GET_SERVER_CERTIFICATE:ASN1 lib
Unable to establish SSL connection.
============================================================================

With kernel 3.6-rc2, the error changes to the following:

============================================================================
--2012-08-24 08:26:42-- https://bugzilla.redhat.com/show_bug.cgi?id=847525
Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
Connecting to bugzilla.redhat.com (bugzilla.redhat.com)|209.132.183.69|:443...
connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `show_bug.cgi?id=847525.3'

[ <=> ] 0 --.-K/s in 0s

2012-08-24 08:26:45 (0.00 B/s) - Read error at byte 0 (error:1408F119:SSL
routines:SSL3_GET_RECORD:decryption failed or bad record mac).Retrying.
============================================================================

> Could it be a PMTU problem ? (check
> http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commit;h=9b04f350057863d1fad1ba071e09362a1da3503e )

I added the PMTU patch to kernel 3.6-rc2 - nothing changed.

Anything else to try?

Thanks,

Larry

2012-08-24 14:56:00

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

Le vendredi 24 août 2012 à 09:09 -0500, Larry Finger a écrit :

> With kernel 3.6-rc2, the error changes to the following:
>
> ============================================================================
> --2012-08-24 08:26:42-- https://bugzilla.redhat.com/show_bug.cgi?id=847525
> Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
> Connecting to bugzilla.redhat.com (bugzilla.redhat.com)|209.132.183.69|:443...
> connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `show_bug.cgi?id=847525.3'
>
> [ <=> ] 0 --.-K/s in 0s
>
> 2012-08-24 08:26:45 (0.00 B/s) - Read error at byte 0 (error:1408F119:SSL
> routines:SSL3_GET_RECORD:decryption failed or bad record mac).Retrying.
> ============================================================================
>

> Anything else to try?

When this happens, nothing in kernel log ?

Could you do

strace -o STRACE -s 8000 wget https://bugzilla.redhat.com/show_bug.cgi?id=847525

and send the STRACE file when you get a failure ?

Thanks


2012-08-24 15:19:53

by David Miller

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

From: Larry Finger <[email protected]>
Date: Fri, 24 Aug 2012 09:09:57 -0500

> I usually get failures on the first or second try. A kernel built from
> "git checkout c862815" gets the following wget output:
>
> ============================================================================
> --2012-08-23 21:50:32--
> --https://bugzilla.redhat.com/show_bug.cgi?id=847525
> Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
> Connecting to bugzilla.redhat.com
> (bugzilla.redhat.com)|209.132.183.69|:443... connected.
> OpenSSL: error:0D0680A8:asn1 encoding routines:ASN1_CHECK_TLEN:wrong
> tag
> OpenSSL: error:0D07803A:asn1 encoding routines:ASN1_ITEM_EX_D2I:nested
> asn1 error
> OpenSSL: error:0D08303A:asn1 encoding
> routines:ASN1_TEMPLATE_NOEXP_D2I:nested asn1 error
> OpenSSL: error:0D08303A:asn1 encoding
> routines:ASN1_TEMPLATE_NOEXP_D2I:nested asn1 error
> OpenSSL: error:1409000D:SSL routines:SSL3_GET_SERVER_CERTIFICATE:ASN1
> lib
> Unable to establish SSL connection.
> ============================================================================

This looks like full-on data corruption to me.

2012-08-24 15:51:14

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 09:55 AM, Eric Dumazet wrote:
> Le vendredi 24 août 2012 à 09:09 -0500, Larry Finger a écrit :
>
>> With kernel 3.6-rc2, the error changes to the following:
>>
>> ============================================================================
>> --2012-08-24 08:26:42-- https://bugzilla.redhat.com/show_bug.cgi?id=847525
>> Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
>> Connecting to bugzilla.redhat.com (bugzilla.redhat.com)|209.132.183.69|:443...
>> connected.
>> HTTP request sent, awaiting response... 200 OK
>> Length: unspecified [text/html]
>> Saving to: `show_bug.cgi?id=847525.3'
>>
>> [ <=> ] 0 --.-K/s in 0s
>>
>> 2012-08-24 08:26:45 (0.00 B/s) - Read error at byte 0 (error:1408F119:SSL
>> routines:SSL3_GET_RECORD:decryption failed or bad record mac).Retrying.
>> ============================================================================
>>
>
>> Anything else to try?
>
> When this happens, nothing in kernel log ?
>
> Could you do
>
> strace -o STRACE -s 8000 wget https://bugzilla.redhat.com/show_bug.cgi?id=847525
>
> and send the STRACE file when you get a failure ?

There is nothing in kernel log when it happens. The file STRACE is attached.

Larry



Attachments:
STRACE (380.75 kB)

2012-08-24 15:58:54

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 10:19 AM, David Miller wrote:
>
> This looks like full-on data corruption to me.

I agree. The question is why does it happen with r8712u, and only after the
commit in the subject. Drivers for other devices that I have are OK. Thus far, I
have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
no reports posted for this bug with any other device.

Larry



2012-08-24 16:01:42

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Fri, 2012-08-24 at 10:49 -0500, Larry Finger wrote:

> There is nothing in kernel log when it happens. The file STRACE is attached.
>

So there is indeed a corruption. What was the driver you used in this
case ?

2012-08-24 16:18:35

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Fri, 2012-08-24 at 10:58 -0500, Larry Finger wrote:
> On 08/24/2012 10:19 AM, David Miller wrote:
> >
> > This looks like full-on data corruption to me.
>
> I agree. The question is why does it happen with r8712u, and only after the
> commit in the subject. Drivers for other devices that I have are OK. Thus far, I
> have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
> no reports posted for this bug with any other device.

bugs can sit unnoticed, and one change somewhere can uncover them.

Really this driver must have a bug, if not half a dozen of bugs.

For example this sequence of code is a clear bug :

sub_skb = dev_alloc_skb(nSubframe_Length + 12);
skb_reserve(sub_skb, 12);


Also the free_recv_skb_queue looks really suspect to me

What the hell is doing recv_tasklet() I really wonder.

This code, combined with the skb_clone() in recvbuf2recvframe()
can clearly reuse an skb passed to upper stacks.


queueing one skb in free_recv_skb_queue should be done
only if no clone of this skb exist somewhere.

Please someone fix this buggy driver.

2012-08-24 16:24:02

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Fri, 2012-08-24 at 18:18 +0200, Eric Dumazet wrote:
> On Fri, 2012-08-24 at 10:58 -0500, Larry Finger wrote:
> > On 08/24/2012 10:19 AM, David Miller wrote:
> > >
> > > This looks like full-on data corruption to me.
> >
> > I agree. The question is why does it happen with r8712u, and only after the
> > commit in the subject. Drivers for other devices that I have are OK. Thus far, I
> > have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
> > no reports posted for this bug with any other device.
>
> bugs can sit unnoticed, and one change somewhere can uncover them.
>
> Really this driver must have a bug, if not half a dozen of bugs.
>
> For example this sequence of code is a clear bug :
>
> sub_skb = dev_alloc_skb(nSubframe_Length + 12);
> skb_reserve(sub_skb, 12);
>
>
> Also the free_recv_skb_queue looks really suspect to me
>
> What the hell is doing recv_tasklet() I really wonder.
>
> This code, combined with the skb_clone() in recvbuf2recvframe()
> can clearly reuse an skb passed to upper stacks.
>
>
> queueing one skb in free_recv_skb_queue should be done
> only if no clone of this skb exist somewhere.
>
> Please someone fix this buggy driver.
>

Try the following patch for a start

diff --git a/drivers/staging/rtl8712/rtl8712_recv.c b/drivers/staging/rtl8712/rtl8712_recv.c
index 8e82ce2..88e3ca6 100644
--- a/drivers/staging/rtl8712/rtl8712_recv.c
+++ b/drivers/staging/rtl8712/rtl8712_recv.c
@@ -1127,6 +1127,9 @@ static void recv_tasklet(void *priv)
recvbuf2recvframe(padapter, pskb);
skb_reset_tail_pointer(pskb);
pskb->len = 0;
- skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
+ if (!skb_cloned(pskb))
+ skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
+ else
+ consume_skb(pskb);
}
}

2012-08-24 16:29:31

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 11:01 AM, Eric Dumazet wrote:
> On Fri, 2012-08-24 at 10:49 -0500, Larry Finger wrote:
>
>> There is nothing in kernel log when it happens. The file STRACE is attached.
>>
>
> So there is indeed a corruption. What was the driver you used in this
> case ?

The only driver that shows this problem is r8712u. One thought that I just had
is that this driver does not use mac80211, which is why it is in staging.
Instead it uses a Realtek soft MAC layer derived from the obsolete softmac
driver that was in the kernel in the 2.6.20 time frame. That means that none of
the skbs have ever been cloned by any part of the wireless code.

Larry

2012-08-24 16:58:26

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 11:23 AM, Eric Dumazet wrote:
> On Fri, 2012-08-24 at 18:18 +0200, Eric Dumazet wrote:
>> On Fri, 2012-08-24 at 10:58 -0500, Larry Finger wrote:
>>> On 08/24/2012 10:19 AM, David Miller wrote:
>>>>
>>>> This looks like full-on data corruption to me.
>>>
>>> I agree. The question is why does it happen with r8712u, and only after the
>>> commit in the subject. Drivers for other devices that I have are OK. Thus far, I
>>> have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
>>> no reports posted for this bug with any other device.
>>
>> bugs can sit unnoticed, and one change somewhere can uncover them.
>>
>> Really this driver must have a bug, if not half a dozen of bugs.
>>
>> For example this sequence of code is a clear bug :
>>
>> sub_skb = dev_alloc_skb(nSubframe_Length + 12);
>> skb_reserve(sub_skb, 12);
>>
>>
>> Also the free_recv_skb_queue looks really suspect to me
>>
>> What the hell is doing recv_tasklet() I really wonder.
>>
>> This code, combined with the skb_clone() in recvbuf2recvframe()
>> can clearly reuse an skb passed to upper stacks.
>>
>>
>> queueing one skb in free_recv_skb_queue should be done
>> only if no clone of this skb exist somewhere.
>>
>> Please someone fix this buggy driver.
>>
>
> Try the following patch for a start
>
> diff --git a/drivers/staging/rtl8712/rtl8712_recv.c b/drivers/staging/rtl8712/rtl8712_recv.c
> index 8e82ce2..88e3ca6 100644
> --- a/drivers/staging/rtl8712/rtl8712_recv.c
> +++ b/drivers/staging/rtl8712/rtl8712_recv.c
> @@ -1127,6 +1127,9 @@ static void recv_tasklet(void *priv)
> recvbuf2recvframe(padapter, pskb);
> skb_reset_tail_pointer(pskb);
> pskb->len = 0;
> - skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
> + if (!skb_cloned(pskb))
> + skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
> + else
> + consume_skb(pskb);
> }
> }

This one did not help. There is no doubt it is needed for the case where memory
is tight, an allocation fails, and the driver clones the skb. In the present
case, debug statements have shown that the skb_clone() call was not made.

In the long term, this driver will be replaced with one that uses mac80211, but
in the short term, I am trying to fix it.

As I said earlier, my skb skills are minimal. Could you explain what is wrong
with the following sequence?

sub_skb = dev_alloc_skb(nSubframe_Length + 12);
skb_reserve(sub_skb, 12);

Thanks,

Larry

2012-08-24 17:47:16

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Fri, 2012-08-24 at 11:58 -0500, Larry Finger wrote:
> On 08/24/2012 11:23 AM, Eric Dumazet wrote:
> > On Fri, 2012-08-24 at 18:18 +0200, Eric Dumazet wrote:
> >> On Fri, 2012-08-24 at 10:58 -0500, Larry Finger wrote:
> >>> On 08/24/2012 10:19 AM, David Miller wrote:
> >>>>
> >>>> This looks like full-on data corruption to me.
> >>>
> >>> I agree. The question is why does it happen with r8712u, and only after the
> >>> commit in the subject. Drivers for other devices that I have are OK. Thus far, I
> >>> have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
> >>> no reports posted for this bug with any other device.
> >>
> >> bugs can sit unnoticed, and one change somewhere can uncover them.
> >>
> >> Really this driver must have a bug, if not half a dozen of bugs.
> >>
> >> For example this sequence of code is a clear bug :
> >>
> >> sub_skb = dev_alloc_skb(nSubframe_Length + 12);
> >> skb_reserve(sub_skb, 12);
> >>
> >>
> >> Also the free_recv_skb_queue looks really suspect to me
> >>
> >> What the hell is doing recv_tasklet() I really wonder.
> >>
> >> This code, combined with the skb_clone() in recvbuf2recvframe()
> >> can clearly reuse an skb passed to upper stacks.
> >>
> >>
> >> queueing one skb in free_recv_skb_queue should be done
> >> only if no clone of this skb exist somewhere.
> >>
> >> Please someone fix this buggy driver.
> >>
> >
> > Try the following patch for a start
> >
> > diff --git a/drivers/staging/rtl8712/rtl8712_recv.c b/drivers/staging/rtl8712/rtl8712_recv.c
> > index 8e82ce2..88e3ca6 100644
> > --- a/drivers/staging/rtl8712/rtl8712_recv.c
> > +++ b/drivers/staging/rtl8712/rtl8712_recv.c
> > @@ -1127,6 +1127,9 @@ static void recv_tasklet(void *priv)
> > recvbuf2recvframe(padapter, pskb);
> > skb_reset_tail_pointer(pskb);
> > pskb->len = 0;
> > - skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
> > + if (!skb_cloned(pskb))
> > + skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
> > + else
> > + consume_skb(pskb);
> > }
> > }
>
> This one did not help. There is no doubt it is needed for the case where memory
> is tight, an allocation fails, and the driver clones the skb. In the present
> case, debug statements have shown that the skb_clone() call was not made.
>
> In the long term, this driver will be replaced with one that uses mac80211, but
> in the short term, I am trying to fix it.
>
> As I said earlier, my skb skills are minimal. Could you explain what is wrong
> with the following sequence?
>
> sub_skb = dev_alloc_skb(nSubframe_Length + 12);
> skb_reserve(sub_skb, 12);

dev_alloc_skb() can return NULL

-> crash

skb_clone() can also return NULL

-> crash


2012-08-27 17:55:47

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 12:47 PM, Eric Dumazet wrote:
>
> dev_alloc_skb() can return NULL
>
> -> crash
>
> skb_clone() can also return NULL
>
> -> crash

I have prepared a patch to fix all the unchecked allocations.

Over the weekend I made some progress. To test the latest vendor driver, I
installed a 32-bit system. Their driver is not compatible with a 64-bit system.
I found that not only did the vendor driver work with secure sites, but so did
the in-kernel version. I now have tcpdump output for the 32-bit case that works,
and the 64-bit case that fails. It seems likely that I missed some 32/64 bit
incompatibility when I did the conversion.

Thanks for all your help in trying to resolve this issue.

Larry

2012-08-27 18:21:48

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Mon, 2012-08-27 at 12:55 -0500, Larry Finger wrote:

> I have prepared a patch to fix all the unchecked allocations.
>
> Over the weekend I made some progress. To test the latest vendor driver, I
> installed a 32-bit system. Their driver is not compatible with a 64-bit system.
> I found that not only did the vendor driver work with secure sites, but so did
> the in-kernel version. I now have tcpdump output for the 32-bit case that works,
> and the 64-bit case that fails. It seems likely that I missed some 32/64 bit
> incompatibility when I did the conversion.
>
> Thanks for all your help in trying to resolve this issue.

Interesting, so these 32/64 bits problems did not happen in prior
kernels ? (< 3.4 )

2012-08-27 20:39:34

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/27/2012 01:21 PM, Eric Dumazet wrote:
> On Mon, 2012-08-27 at 12:55 -0500, Larry Finger wrote:
>
>> I have prepared a patch to fix all the unchecked allocations.
>>
>> Over the weekend I made some progress. To test the latest vendor driver, I
>> installed a 32-bit system. Their driver is not compatible with a 64-bit system.
>> I found that not only did the vendor driver work with secure sites, but so did
>> the in-kernel version. I now have tcpdump output for the 32-bit case that works,
>> and the 64-bit case that fails. It seems likely that I missed some 32/64 bit
>> incompatibility when I did the conversion.
>>
>> Thanks for all your help in trying to resolve this issue.
>
> Interesting, so these 32/64 bits problems did not happen in prior
> kernels ? (< 3.4 )

Prior kernels are OK.

Larry