2012-08-22 04:07:51

by Larry Finger

[permalink] [raw]
Subject: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

Hi,

The commit entitled "tcp: reduce out_of_order memory use" turns out to cause
problems with a number of USB drivers.

The first one called to my attention was for staging/r8712u. For this driver,
there are problems with SSL communications as reported in
https://bugzilla.redhat.com/show_bug.cgi?id=847525.

Other drivers including rtl8187, rtl8192cu and rt2800usb cannot connect to a
WPA1- or WEP-encrypted AP, but work fine with WPA2. The rtl8192cu problem is
reported in https://bugzilla.kernel.org/show_bug.cgi?id=46171.

I find it hard to understand why this patch should cause these effects; however,
I have verified that a kernel generated from "git checkout e86b2919" works fine.
This commit is immediately before the patch in question.

Note, this patch was applied during the merge between 3.3 and 3.4. The
regression has been undiscovered for some time.

Any help on this problem will be appreciated.

Larry


2012-08-27 20:39:32

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/27/2012 01:21 PM, Eric Dumazet wrote:
> On Mon, 2012-08-27 at 12:55 -0500, Larry Finger wrote:
>
>> I have prepared a patch to fix all the unchecked allocations.
>>
>> Over the weekend I made some progress. To test the latest vendor driver, I
>> installed a 32-bit system. Their driver is not compatible with a 64-bit system.
>> I found that not only did the vendor driver work with secure sites, but so did
>> the in-kernel version. I now have tcpdump output for the 32-bit case that works,
>> and the 64-bit case that fails. It seems likely that I missed some 32/64 bit
>> incompatibility when I did the conversion.
>>
>> Thanks for all your help in trying to resolve this issue.
>
> Interesting, so these 32/64 bits problems did not happen in prior
> kernels ? (< 3.4 )

Prior kernels are OK.

Larry


2012-08-24 16:58:24

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 11:23 AM, Eric Dumazet wrote:
> On Fri, 2012-08-24 at 18:18 +0200, Eric Dumazet wrote:
>> On Fri, 2012-08-24 at 10:58 -0500, Larry Finger wrote:
>>> On 08/24/2012 10:19 AM, David Miller wrote:
>>>>
>>>> This looks like full-on data corruption to me.
>>>
>>> I agree. The question is why does it happen with r8712u, and only after the
>>> commit in the subject. Drivers for other devices that I have are OK. Thus far, I
>>> have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
>>> no reports posted for this bug with any other device.
>>
>> bugs can sit unnoticed, and one change somewhere can uncover them.
>>
>> Really this driver must have a bug, if not half a dozen of bugs.
>>
>> For example this sequence of code is a clear bug :
>>
>> sub_skb = dev_alloc_skb(nSubframe_Length + 12);
>> skb_reserve(sub_skb, 12);
>>
>>
>> Also the free_recv_skb_queue looks really suspect to me
>>
>> What the hell is doing recv_tasklet() I really wonder.
>>
>> This code, combined with the skb_clone() in recvbuf2recvframe()
>> can clearly reuse an skb passed to upper stacks.
>>
>>
>> queueing one skb in free_recv_skb_queue should be done
>> only if no clone of this skb exist somewhere.
>>
>> Please someone fix this buggy driver.
>>
>
> Try the following patch for a start
>
> diff --git a/drivers/staging/rtl8712/rtl8712_recv.c b/drivers/staging/rtl8712/rtl8712_recv.c
> index 8e82ce2..88e3ca6 100644
> --- a/drivers/staging/rtl8712/rtl8712_recv.c
> +++ b/drivers/staging/rtl8712/rtl8712_recv.c
> @@ -1127,6 +1127,9 @@ static void recv_tasklet(void *priv)
> recvbuf2recvframe(padapter, pskb);
> skb_reset_tail_pointer(pskb);
> pskb->len = 0;
> - skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
> + if (!skb_cloned(pskb))
> + skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
> + else
> + consume_skb(pskb);
> }
> }

This one did not help. There is no doubt it is needed for the case where memory
is tight, an allocation fails, and the driver clones the skb. In the present
case, debug statements have shown that the skb_clone() call was not made.

In the long term, this driver will be replaced with one that uses mac80211, but
in the short term, I am trying to fix it.

As I said earlier, my skb skills are minimal. Could you explain what is wrong
with the following sequence?

sub_skb = dev_alloc_skb(nSubframe_Length + 12);
skb_reserve(sub_skb, 12);

Thanks,

Larry



2012-08-24 14:10:02

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/23/2012 04:26 PM, Eric Dumazet wrote:
>
> skb->truesize is adjusted when a frag is added to one skb, or when
> skb->head is re-allocated.
>
> Are you sure you dont have another problem, because as I said commit
> c8628155ece3 had a bug, so a bisect is not very useful.
>
> How many reloads are needed to trigger the bug, do you have a script to
> reproduce it ?

I was doing the tests with a browser; however, it seems that a wget of that same
page is much more effective at triggering the problem. Now I have a script that
tries 60 passes of

wget https://bugzilla.redhat.com/show_bug.cgi?id=847525

I usually get failures on the first or second try. A kernel built from "git
checkout c862815" gets the following wget output:

============================================================================
--2012-08-23 21:50:32-- https://bugzilla.redhat.com/show_bug.cgi?id=847525
Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
Connecting to bugzilla.redhat.com (bugzilla.redhat.com)|209.132.183.69|:443...
connected.
OpenSSL: error:0D0680A8:asn1 encoding routines:ASN1_CHECK_TLEN:wrong tag
OpenSSL: error:0D07803A:asn1 encoding routines:ASN1_ITEM_EX_D2I:nested asn1 error
OpenSSL: error:0D08303A:asn1 encoding routines:ASN1_TEMPLATE_NOEXP_D2I:nested
asn1 error
OpenSSL: error:0D08303A:asn1 encoding routines:ASN1_TEMPLATE_NOEXP_D2I:nested
asn1 error
OpenSSL: error:1409000D:SSL routines:SSL3_GET_SERVER_CERTIFICATE:ASN1 lib
Unable to establish SSL connection.
============================================================================

With kernel 3.6-rc2, the error changes to the following:

============================================================================
--2012-08-24 08:26:42-- https://bugzilla.redhat.com/show_bug.cgi?id=847525
Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
Connecting to bugzilla.redhat.com (bugzilla.redhat.com)|209.132.183.69|:443...
connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `show_bug.cgi?id=847525.3'

[ <=> ] 0 --.-K/s in 0s

2012-08-24 08:26:45 (0.00 B/s) - Read error at byte 0 (error:1408F119:SSL
routines:SSL3_GET_RECORD:decryption failed or bad record mac).Retrying.
============================================================================

> Could it be a PMTU problem ? (check
> http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commit;h=9b04f350057863d1fad1ba071e09362a1da3503e )

I added the PMTU patch to kernel 3.6-rc2 - nothing changed.

Anything else to try?

Thanks,

Larry


2012-08-27 18:21:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Mon, 2012-08-27 at 12:55 -0500, Larry Finger wrote:

> I have prepared a patch to fix all the unchecked allocations.
>
> Over the weekend I made some progress. To test the latest vendor driver, I
> installed a 32-bit system. Their driver is not compatible with a 64-bit system.
> I found that not only did the vendor driver work with secure sites, but so did
> the in-kernel version. I now have tcpdump output for the 32-bit case that works,
> and the 64-bit case that fails. It seems likely that I missed some 32/64 bit
> incompatibility when I did the conversion.
>
> Thanks for all your help in trying to resolve this issue.

Interesting, so these 32/64 bits problems did not happen in prior
kernels ? (< 3.4 )



2012-08-22 04:26:40

by David Miller

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

From: Larry Finger <[email protected]>
Date: Tue, 21 Aug 2012 23:07:46 -0500

> I find it hard to understand why this patch should cause these
> effects; however, I have verified that a kernel generated from "git
> checkout e86b2919" works fine. This commit is immediately before the
> patch in question.

I can guarentee it's improper skb->truesize setting in these drivers,
or use of too huge buffer sizes compared to the amount of data
contained within the packet.

2012-08-22 21:33:26

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/22/2012 12:15 AM, Eric Dumazet wrote:
>
> This particular commit is the start of a patches batch that ended in the
> generic TCP coalescing mechanism.
>
> It is known to have problem on drivers doing skb_clone() in their rx
> path.
>
> Current kernels should be ok, because coalescing doesnt happen if the
> destination skb is cloned (skb_cloned(to) in skb_try_coalesce())

The skb_clone() call is not the source of the problem for r8712u, as it is only
used when a memory allocation fails, which is not happening. The suggestion did
lead to another patch that I had been preparing. The initial allocation of RX
buffers used a size of 30720 bytes, while 9100 is sufficient to allow
aggregation. Upon reducing the buffer size, the driver now works for me. I am
now awaiting tests by the OP on the Red Hat bugzilla before sending the patch
upstream.

So far, no ideas for the problem in connecting to WPA1 networks.

Larry



2012-08-24 15:50:54

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 09:55 AM, Eric Dumazet wrote:
> Le vendredi 24 août 2012 à 09:09 -0500, Larry Finger a écrit :
>
>> With kernel 3.6-rc2, the error changes to the following:
>>
>> ============================================================================
>> --2012-08-24 08:26:42-- https://bugzilla.redhat.com/show_bug.cgi?id=847525
>> Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
>> Connecting to bugzilla.redhat.com (bugzilla.redhat.com)|209.132.183.69|:443...
>> connected.
>> HTTP request sent, awaiting response... 200 OK
>> Length: unspecified [text/html]
>> Saving to: `show_bug.cgi?id=847525.3'
>>
>> [ <=> ] 0 --.-K/s in 0s
>>
>> 2012-08-24 08:26:45 (0.00 B/s) - Read error at byte 0 (error:1408F119:SSL
>> routines:SSL3_GET_RECORD:decryption failed or bad record mac).Retrying.
>> ============================================================================
>>
>
>> Anything else to try?
>
> When this happens, nothing in kernel log ?
>
> Could you do
>
> strace -o STRACE -s 8000 wget https://bugzilla.redhat.com/show_bug.cgi?id=847525
>
> and send the STRACE file when you get a failure ?

There is nothing in kernel log when it happens. The file STRACE is attached.

Larry



Attachments:
STRACE (380.75 kB)

2012-08-24 16:18:20

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Fri, 2012-08-24 at 10:58 -0500, Larry Finger wrote:
> On 08/24/2012 10:19 AM, David Miller wrote:
> >
> > This looks like full-on data corruption to me.
>
> I agree. The question is why does it happen with r8712u, and only after the
> commit in the subject. Drivers for other devices that I have are OK. Thus far, I
> have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
> no reports posted for this bug with any other device.

bugs can sit unnoticed, and one change somewhere can uncover them.

Really this driver must have a bug, if not half a dozen of bugs.

For example this sequence of code is a clear bug :

sub_skb = dev_alloc_skb(nSubframe_Length + 12);
skb_reserve(sub_skb, 12);


Also the free_recv_skb_queue looks really suspect to me

What the hell is doing recv_tasklet() I really wonder.

This code, combined with the skb_clone() in recvbuf2recvframe()
can clearly reuse an skb passed to upper stacks.


queueing one skb in free_recv_skb_queue should be done
only if no clone of this skb exist somewhere.

Please someone fix this buggy driver.



2012-08-24 15:19:50

by David Miller

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

From: Larry Finger <[email protected]>
Date: Fri, 24 Aug 2012 09:09:57 -0500

> I usually get failures on the first or second try. A kernel built from
> "git checkout c862815" gets the following wget output:
>
> ============================================================================
> --2012-08-23 21:50:32--
> --https://bugzilla.redhat.com/show_bug.cgi?id=847525
> Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
> Connecting to bugzilla.redhat.com
> (bugzilla.redhat.com)|209.132.183.69|:443... connected.
> OpenSSL: error:0D0680A8:asn1 encoding routines:ASN1_CHECK_TLEN:wrong
> tag
> OpenSSL: error:0D07803A:asn1 encoding routines:ASN1_ITEM_EX_D2I:nested
> asn1 error
> OpenSSL: error:0D08303A:asn1 encoding
> routines:ASN1_TEMPLATE_NOEXP_D2I:nested asn1 error
> OpenSSL: error:0D08303A:asn1 encoding
> routines:ASN1_TEMPLATE_NOEXP_D2I:nested asn1 error
> OpenSSL: error:1409000D:SSL routines:SSL3_GET_SERVER_CERTIFICATE:ASN1
> lib
> Unable to establish SSL connection.
> ============================================================================

This looks like full-on data corruption to me.

2012-08-24 16:23:58

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Fri, 2012-08-24 at 18:18 +0200, Eric Dumazet wrote:
> On Fri, 2012-08-24 at 10:58 -0500, Larry Finger wrote:
> > On 08/24/2012 10:19 AM, David Miller wrote:
> > >
> > > This looks like full-on data corruption to me.
> >
> > I agree. The question is why does it happen with r8712u, and only after the
> > commit in the subject. Drivers for other devices that I have are OK. Thus far, I
> > have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
> > no reports posted for this bug with any other device.
>
> bugs can sit unnoticed, and one change somewhere can uncover them.
>
> Really this driver must have a bug, if not half a dozen of bugs.
>
> For example this sequence of code is a clear bug :
>
> sub_skb = dev_alloc_skb(nSubframe_Length + 12);
> skb_reserve(sub_skb, 12);
>
>
> Also the free_recv_skb_queue looks really suspect to me
>
> What the hell is doing recv_tasklet() I really wonder.
>
> This code, combined with the skb_clone() in recvbuf2recvframe()
> can clearly reuse an skb passed to upper stacks.
>
>
> queueing one skb in free_recv_skb_queue should be done
> only if no clone of this skb exist somewhere.
>
> Please someone fix this buggy driver.
>

Try the following patch for a start

diff --git a/drivers/staging/rtl8712/rtl8712_recv.c b/drivers/staging/rtl8712/rtl8712_recv.c
index 8e82ce2..88e3ca6 100644
--- a/drivers/staging/rtl8712/rtl8712_recv.c
+++ b/drivers/staging/rtl8712/rtl8712_recv.c
@@ -1127,6 +1127,9 @@ static void recv_tasklet(void *priv)
recvbuf2recvframe(padapter, pskb);
skb_reset_tail_pointer(pskb);
pskb->len = 0;
- skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
+ if (!skb_cloned(pskb))
+ skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
+ else
+ consume_skb(pskb);
}
}



2012-08-24 16:01:39

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Fri, 2012-08-24 at 10:49 -0500, Larry Finger wrote:

> There is nothing in kernel log when it happens. The file STRACE is attached.
>

So there is indeed a corruption. What was the driver you used in this
case ?



2012-08-23 21:26:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Thu, 2012-08-23 at 15:57 -0500, Larry Finger wrote:
> On 08/22/2012 11:03 PM, Eric Dumazet wrote:
> >
> > Changing the allocation size removes the problem ? thats really strange.
> >
> > If you try different sizes in the 9100-30720 range, can you pinpoint the
> > failure threshold ?
>
> The allocation size change did not fix the problem. It turned out that 10 tries
> from a secure web page were not enough to trigger this intermittent problem that
> particular test.
>
> Based on DaveM's comment that skb->truesize could be wrong, I tried setting
> truesize after every netdev_alloc_skb() call. Of course, that had no effect. I
> then found https://lkml.org/lkml/2010/11/19/505I, which clearly states why this
> need not be done.
>
> What skb modifications require that truesize be adjusted? The driver never
> resets skb->len or skb->data_len for any buffers, other than setting skb->len to
> zero.

skb->truesize is adjusted when a frag is added to one skb, or when
skb->head is re-allocated.

Are you sure you dont have another problem, because as I said commit
c8628155ece3 had a bug, so a bisect is not very useful.

How many reloads are needed to trigger the bug, do you have a script to
reproduce it ?

Could it be a PMTU problem ? (check
http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commit;h=9b04f350057863d1fad1ba071e09362a1da3503e )



2012-08-24 17:47:10

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Fri, 2012-08-24 at 11:58 -0500, Larry Finger wrote:
> On 08/24/2012 11:23 AM, Eric Dumazet wrote:
> > On Fri, 2012-08-24 at 18:18 +0200, Eric Dumazet wrote:
> >> On Fri, 2012-08-24 at 10:58 -0500, Larry Finger wrote:
> >>> On 08/24/2012 10:19 AM, David Miller wrote:
> >>>>
> >>>> This looks like full-on data corruption to me.
> >>>
> >>> I agree. The question is why does it happen with r8712u, and only after the
> >>> commit in the subject. Drivers for other devices that I have are OK. Thus far, I
> >>> have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
> >>> no reports posted for this bug with any other device.
> >>
> >> bugs can sit unnoticed, and one change somewhere can uncover them.
> >>
> >> Really this driver must have a bug, if not half a dozen of bugs.
> >>
> >> For example this sequence of code is a clear bug :
> >>
> >> sub_skb = dev_alloc_skb(nSubframe_Length + 12);
> >> skb_reserve(sub_skb, 12);
> >>
> >>
> >> Also the free_recv_skb_queue looks really suspect to me
> >>
> >> What the hell is doing recv_tasklet() I really wonder.
> >>
> >> This code, combined with the skb_clone() in recvbuf2recvframe()
> >> can clearly reuse an skb passed to upper stacks.
> >>
> >>
> >> queueing one skb in free_recv_skb_queue should be done
> >> only if no clone of this skb exist somewhere.
> >>
> >> Please someone fix this buggy driver.
> >>
> >
> > Try the following patch for a start
> >
> > diff --git a/drivers/staging/rtl8712/rtl8712_recv.c b/drivers/staging/rtl8712/rtl8712_recv.c
> > index 8e82ce2..88e3ca6 100644
> > --- a/drivers/staging/rtl8712/rtl8712_recv.c
> > +++ b/drivers/staging/rtl8712/rtl8712_recv.c
> > @@ -1127,6 +1127,9 @@ static void recv_tasklet(void *priv)
> > recvbuf2recvframe(padapter, pskb);
> > skb_reset_tail_pointer(pskb);
> > pskb->len = 0;
> > - skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
> > + if (!skb_cloned(pskb))
> > + skb_queue_tail(&precvpriv->free_recv_skb_queue, pskb);
> > + else
> > + consume_skb(pskb);
> > }
> > }
>
> This one did not help. There is no doubt it is needed for the case where memory
> is tight, an allocation fails, and the driver clones the skb. In the present
> case, debug statements have shown that the skb_clone() call was not made.
>
> In the long term, this driver will be replaced with one that uses mac80211, but
> in the short term, I am trying to fix it.
>
> As I said earlier, my skb skills are minimal. Could you explain what is wrong
> with the following sequence?
>
> sub_skb = dev_alloc_skb(nSubframe_Length + 12);
> skb_reserve(sub_skb, 12);

dev_alloc_skb() can return NULL

-> crash

skb_clone() can also return NULL

-> crash




2012-08-24 14:55:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

Le vendredi 24 août 2012 à 09:09 -0500, Larry Finger a écrit :

> With kernel 3.6-rc2, the error changes to the following:
>
> ============================================================================
> --2012-08-24 08:26:42-- https://bugzilla.redhat.com/show_bug.cgi?id=847525
> Resolving bugzilla.redhat.com (bugzilla.redhat.com)... 209.132.183.69
> Connecting to bugzilla.redhat.com (bugzilla.redhat.com)|209.132.183.69|:443...
> connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `show_bug.cgi?id=847525.3'
>
> [ <=> ] 0 --.-K/s in 0s
>
> 2012-08-24 08:26:45 (0.00 B/s) - Read error at byte 0 (error:1408F119:SSL
> routines:SSL3_GET_RECORD:decryption failed or bad record mac).Retrying.
> ============================================================================
>

> Anything else to try?

When this happens, nothing in kernel log ?

Could you do

strace -o STRACE -s 8000 wget https://bugzilla.redhat.com/show_bug.cgi?id=847525

and send the STRACE file when you get a failure ?

Thanks




2012-08-22 16:00:18

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/22/2012 12:15 AM, Eric Dumazet wrote:
>
> This particular commit is the start of a patches batch that ended in the
> generic TCP coalescing mechanism.
>
> It is known to have problem on drivers doing skb_clone() in their rx
> path.
>
> Current kernels should be ok, because coalescing doesnt happen if the
> destination skb is cloned (skb_cloned(to) in skb_try_coalesce())
>
> For 3.4 kernel, I guess I need to backport this skb_cloned(to) check fo
> stable 3.4 kernel
>
> But these skb_clone() in various USB drivers should be killed for good,
> they really can kill the box because of skb->truesize lies.

Applying your patch to 3.4 does not fix the problem. In addition, all of the
problems I noted happen with 3.6-rc2 - current kernels are not OK.

Of the drivers I reported, only r8712u in staging/rtl8712/rtl8712_recv.c has a
direct call to skb_clone(). As its symptoms are different, the cloning is likely
the source of the problems there. In addition, the problem is intermittent. The
driver normally allocates a new buffer and copies to it, and the clone only
occurs when that skb allocation fails. I will try to fix that. My skb skills are
minimal and I will have to do some studying.

Thanks,

Larry



2012-08-27 17:55:45

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 12:47 PM, Eric Dumazet wrote:
>
> dev_alloc_skb() can return NULL
>
> -> crash
>
> skb_clone() can also return NULL
>
> -> crash

I have prepared a patch to fix all the unchecked allocations.

Over the weekend I made some progress. To test the latest vendor driver, I
installed a 32-bit system. Their driver is not compatible with a 64-bit system.
I found that not only did the vendor driver work with secure sites, but so did
the in-kernel version. I now have tcpdump output for the 32-bit case that works,
and the 64-bit case that fails. It seems likely that I missed some 32/64 bit
incompatibility when I did the conversion.

Thanks for all your help in trying to resolve this issue.

Larry



2012-08-23 20:57:14

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/22/2012 11:03 PM, Eric Dumazet wrote:
>
> Changing the allocation size removes the problem ? thats really strange.
>
> If you try different sizes in the 9100-30720 range, can you pinpoint the
> failure threshold ?

The allocation size change did not fix the problem. It turned out that 10 tries
from a secure web page were not enough to trigger this intermittent problem that
particular test.

Based on DaveM's comment that skb->truesize could be wrong, I tried setting
truesize after every netdev_alloc_skb() call. Of course, that had no effect. I
then found https://lkml.org/lkml/2010/11/19/505I, which clearly states why this
need not be done.

What skb modifications require that truesize be adjusted? The driver never
resets skb->len or skb->data_len for any buffers, other than setting skb->len to
zero.

Thanks,

Larry


2012-08-24 16:29:26

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 11:01 AM, Eric Dumazet wrote:
> On Fri, 2012-08-24 at 10:49 -0500, Larry Finger wrote:
>
>> There is nothing in kernel log when it happens. The file STRACE is attached.
>>
>
> So there is indeed a corruption. What was the driver you used in this
> case ?

The only driver that shows this problem is r8712u. One thought that I just had
is that this driver does not use mac80211, which is why it is in staging.
Instead it uses a Realtek soft MAC layer derived from the obsolete softmac
driver that was in the kernel in the 2.6.20 time frame. That means that none of
the skbs have ever been cloned by any part of the wireless code.

Larry



2012-08-24 15:58:51

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 08/24/2012 10:19 AM, David Miller wrote:
>
> This looks like full-on data corruption to me.

I agree. The question is why does it happen with r8712u, and only after the
commit in the subject. Drivers for other devices that I have are OK. Thus far, I
have tested b43, rtl8187, ath9k_htc, and rtl8192cu. To my knowledge, there are
no reports posted for this bug with any other device.

Larry





2012-08-22 05:15:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Tue, 2012-08-21 at 23:07 -0500, Larry Finger wrote:
> Hi,
>
> The commit entitled "tcp: reduce out_of_order memory use" turns out to cause
> problems with a number of USB drivers.
>
> The first one called to my attention was for staging/r8712u. For this driver,
> there are problems with SSL communications as reported in
> https://bugzilla.redhat.com/show_bug.cgi?id=847525.
>
> Other drivers including rtl8187, rtl8192cu and rt2800usb cannot connect to a
> WPA1- or WEP-encrypted AP, but work fine with WPA2. The rtl8192cu problem is
> reported in https://bugzilla.kernel.org/show_bug.cgi?id=46171.
>
> I find it hard to understand why this patch should cause these effects; however,
> I have verified that a kernel generated from "git checkout e86b2919" works fine.
> This commit is immediately before the patch in question.
>
> Note, this patch was applied during the merge between 3.3 and 3.4. The
> regression has been undiscovered for some time.
>
> Any help on this problem will be appreciated.
>
> Larry

This particular commit is the start of a patches batch that ended in the
generic TCP coalescing mechanism.

It is known to have problem on drivers doing skb_clone() in their rx
path.

Current kernels should be ok, because coalescing doesnt happen if the
destination skb is cloned (skb_cloned(to) in skb_try_coalesce())

For 3.4 kernel, I guess I need to backport this skb_cloned(to) check fo
stable 3.4 kernel

But these skb_clone() in various USB drivers should be killed for good,
they really can kill the box because of skb->truesize lies.

Please test following patch (for 3.4 kernels)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 257b617..c45ac2d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4496,7 +4496,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
* to avoid future tcp_collapse_ofo_queue(),
* probably the most expensive function in tcp stack.
*/
- if (skb->len <= skb_tailroom(skb1) && !tcp_hdr(skb)->fin) {
+ if (skb->len <= skb_tailroom(skb1) &&
+ !skb_cloned(skb1) &&
+ !tcp_hdr(skb)->fin) {
NET_INC_STATS_BH(sock_net(sk),
LINUX_MIB_TCPRCVCOALESCE);
BUG_ON(skb_copy_bits(skb, 0,



2012-08-23 04:03:19

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Wed, 2012-08-22 at 16:33 -0500, Larry Finger wrote:
> On 08/22/2012 12:15 AM, Eric Dumazet wrote:
> >
> > This particular commit is the start of a patches batch that ended in the
> > generic TCP coalescing mechanism.
> >
> > It is known to have problem on drivers doing skb_clone() in their rx
> > path.
> >
> > Current kernels should be ok, because coalescing doesnt happen if the
> > destination skb is cloned (skb_cloned(to) in skb_try_coalesce())
>
> The skb_clone() call is not the source of the problem for r8712u, as it is only
> used when a memory allocation fails, which is not happening. The suggestion did
> lead to another patch that I had been preparing. The initial allocation of RX
> buffers used a size of 30720 bytes, while 9100 is sufficient to allow
> aggregation. Upon reducing the buffer size, the driver now works for me. I am
> now awaiting tests by the OP on the Red Hat bugzilla before sending the patch
> upstream.
>
> So far, no ideas for the problem in connecting to WPA1 networks.
>

Changing the allocation size removes the problem ? thats really strange.

If you try different sizes in the 9100-30720 range, can you pinpoint the
failure threshold ?




2012-09-10 18:34:59

by Larry Finger

[permalink] [raw]
Subject: Re: [PATCH] staging: r8712u: fix bug in r8712_recv_indicatepkt()

On 09/10/2012 12:55 PM, Eric Dumazet wrote:
> From: Eric Dumazet <[email protected]>
>
> 64bit arches have a buggy r8712u driver, let's fix it.
>
> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> drivers/staging/rtl8712/recv_linux.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/drivers/staging/rtl8712/recv_linux.c b/drivers/staging/rtl8712/recv_linux.c
> index 0e26d5f..495ee12 100644
> --- a/drivers/staging/rtl8712/recv_linux.c
> +++ b/drivers/staging/rtl8712/recv_linux.c
> @@ -117,13 +117,8 @@ void r8712_recv_indicatepkt(struct _adapter *padapter,
> if (skb == NULL)
> goto _recv_indicatepkt_drop;
> skb->data = precv_frame->u.hdr.rx_data;
> -#ifdef NET_SKBUFF_DATA_USES_OFFSET
> - skb->tail = (sk_buff_data_t)(precv_frame->u.hdr.rx_tail -
> - precv_frame->u.hdr.rx_head);
> -#else
> - skb->tail = (sk_buff_data_t)precv_frame->u.hdr.rx_tail;
> -#endif
> skb->len = precv_frame->u.hdr.len;
> + skb_set_tail_pointer(skb, skb->len);
> if ((pattrib->tcpchk_valid == 1) && (pattrib->tcp_chkrpt == 1))
> skb->ip_summed = CHECKSUM_UNNECESSARY;
> else

Eric,

Thanks for this. It works for me. I had looked at this code a number of times,
but for some reason, I thought that the u.hdr.rx_yyyy parameters were like
32-bit systems and that all 4 values were pointers, and not like 64-bit systems,
thus the funny conversion. Strange that the bug was never triggered until commit
c8628155ece3 - this code has not changed since the driver was accepted into 2.6.37.

A few points on the patch. As the driver is in staging, the patch needs to go to
Greg Kroah-Hartman. In addition, please add the "Cc: Stable
<[email protected]>" line. You may also give an a s-o-b for me. Finally,
this patch should fix https://bugzilla.redhat.com/show_bug.cgi?id=847525, and
https://bugzilla.kernel.org/show_bug.cgi?id=45071.

Larry


2012-09-10 08:39:33

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Mon, 2012-08-27 at 12:55 -0500, Larry Finger wrote:

> I have prepared a patch to fix all the unchecked allocations.
>
> Over the weekend I made some progress. To test the latest vendor driver, I
> installed a 32-bit system. Their driver is not compatible with a 64-bit system.
> I found that not only did the vendor driver work with secure sites, but so did
> the in-kernel version. I now have tcpdump output for the 32-bit case that works,
> and the 64-bit case that fails. It seems likely that I missed some 32/64 bit
> incompatibility when I did the conversion.
>
> Thanks for all your help in trying to resolve this issue.
>
> Larry
>
>

Hi Larry

It appears I have a D-Link N300 (DWA-131) nano USB adapter, using
staging/rtl8712 driver.

I tried many kernel versions (including 3.3) and none seems to work
reliably.

Sometime, I have some traffic but only for about 50 frames...
It might be because my access point is a netgear wndr3800, because I
have following warning a bit before the freezes :

r8712u: [r8712_got_addbareq_event_callback] mac = 20:4e:7f:5a:cd:30, sea = 80, tid = 0

Thanks



2012-09-10 15:04:12

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On Mon, 2012-09-10 at 09:53 -0500, Larry Finger wrote:

> Eric,
>
> What is the md5sum for the firmware file /lib/firmware/rtlwifi/rtl8712u.bin?
> Over the weekend, there was a report of another device that had problems with
> firmware that was added to the linux-firmware repo in July with md5sum of
> c6f3b7b880aefb7b3f249428d659bdbb. An older version with md5sum of
> 200fd952db3cc9259b1fd05e3e51966f works in that case. I'll send you this one
> privately.
>
> I have a Netgear WNDR3300 and also get the addbareq events, but I do not get
> freezes. I'm not sure the message is correlated.
>

It seems I have the c6f3b7b880aefb7b3f249428d659bdbb one
# md5sum /data/src/linux-firmware/rtlwifi/rtl8712u.bin /lib/firmware/rtlwifi/rtl8712u.bin
c6f3b7b880aefb7b3f249428d659bdbb /data/src/linux-firmware/rtlwifi/rtl8712u.bin
c6f3b7b880aefb7b3f249428d659bdbb /lib/firmware/rtlwifi/rtl8712u.bin

Since I have linux-firmware tree, should I go back to initial commit
commit 8f919160792e4702c6b7a67a243cea4f757407e4
Author: Larry Finger <[email protected]>
Date: Mon Nov 1 23:56:52 2010 -0500

linux-firmware: Add firmware files for Realtek RTL8712U and
RTL8192CE

Thanks !




2012-09-10 17:55:24

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] staging: r8712u: fix bug in r8712_recv_indicatepkt()

From: Eric Dumazet <[email protected]>

64bit arches have a buggy r8712u driver, let's fix it.

Signed-off-by: Eric Dumazet <[email protected]>
---
drivers/staging/rtl8712/recv_linux.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/staging/rtl8712/recv_linux.c b/drivers/staging/rtl8712/recv_linux.c
index 0e26d5f..495ee12 100644
--- a/drivers/staging/rtl8712/recv_linux.c
+++ b/drivers/staging/rtl8712/recv_linux.c
@@ -117,13 +117,8 @@ void r8712_recv_indicatepkt(struct _adapter *padapter,
if (skb == NULL)
goto _recv_indicatepkt_drop;
skb->data = precv_frame->u.hdr.rx_data;
-#ifdef NET_SKBUFF_DATA_USES_OFFSET
- skb->tail = (sk_buff_data_t)(precv_frame->u.hdr.rx_tail -
- precv_frame->u.hdr.rx_head);
-#else
- skb->tail = (sk_buff_data_t)precv_frame->u.hdr.rx_tail;
-#endif
skb->len = precv_frame->u.hdr.len;
+ skb_set_tail_pointer(skb, skb->len);
if ((pattrib->tcpchk_valid == 1) && (pattrib->tcp_chkrpt == 1))
skb->ip_summed = CHECKSUM_UNNECESSARY;
else



2012-09-10 14:53:55

by Larry Finger

[permalink] [raw]
Subject: Re: Regression associated with commit c8628155ece3 - "tcp: reduce out_of_order memory use"

On 09/10/2012 03:39 AM, Eric Dumazet wrote:
> On Mon, 2012-08-27 at 12:55 -0500, Larry Finger wrote:
>
>> I have prepared a patch to fix all the unchecked allocations.
>>
>> Over the weekend I made some progress. To test the latest vendor driver, I
>> installed a 32-bit system. Their driver is not compatible with a 64-bit system.
>> I found that not only did the vendor driver work with secure sites, but so did
>> the in-kernel version. I now have tcpdump output for the 32-bit case that works,
>> and the 64-bit case that fails. It seems likely that I missed some 32/64 bit
>> incompatibility when I did the conversion.
>>
>> Thanks for all your help in trying to resolve this issue.
>>
>> Larry
>>
>>
>
> Hi Larry
>
> It appears I have a D-Link N300 (DWA-131) nano USB adapter, using
> staging/rtl8712 driver.
>
> I tried many kernel versions (including 3.3) and none seems to work
> reliably.
>
> Sometime, I have some traffic but only for about 50 frames...
> It might be because my access point is a netgear wndr3800, because I
> have following warning a bit before the freezes :
>
> r8712u: [r8712_got_addbareq_event_callback] mac = 20:4e:7f:5a:cd:30, sea = 80, tid = 0

Eric,

What is the md5sum for the firmware file /lib/firmware/rtlwifi/rtl8712u.bin?
Over the weekend, there was a report of another device that had problems with
firmware that was added to the linux-firmware repo in July with md5sum of
c6f3b7b880aefb7b3f249428d659bdbb. An older version with md5sum of
200fd952db3cc9259b1fd05e3e51966f works in that case. I'll send you this one
privately.

I have a Netgear WNDR3300 and also get the addbareq events, but I do not get
freezes. I'm not sure the message is correlated.

Larry