2016-10-26 22:47:40

by Mark Lord

[permalink] [raw]
Subject: [PATCH] drivers/net/usb/r8152 fix broken rx checksums

The r8152 driver has been broken since (approx) 3.6.16,
when support was added for hardware rx checksum on newer chip versions.
Symptoms include random segfaults and silent data corruption over NFS.

This does not work on the VER_02 dongle I have here
when used with a slow embedded system CPU.
Google reveals others reporting similar issues on Raspberry Pi.

So, disable hardware rx checksum for VER_02, and fix
an obvious coding error for IPV6 checksums in the same function.

Because this bug results in silent data corruption,
it is a good candidate for back-porting to -stable >= 3.16.xx.
Patch attached (to deal with buggy mailer) and also below for review.

Signed-off-by: Mark Lord <[email protected]>

--- old/drivers/net/usb/r8152.c 2016-09-30 04:20:43.000000000 -0400
+++ linux/drivers/net/usb/r8152.c 2016-10-26 14:15:44.932517676 -0400
@@ -1645,7 +1645,7 @@
u8 checksum = CHECKSUM_NONE;
u32 opts2, opts3;

- if (tp->version == RTL_VER_01)
+ if (tp->version == RTL_VER_01 || tp->version == RTL_VER_02)
goto return_result;

opts2 = le32_to_cpu(rx_desc->opts2);
@@ -1660,7 +1660,7 @@
checksum = CHECKSUM_NONE;
else
checksum = CHECKSUM_UNNECESSARY;
- } else if (RD_IPV6_CS) {
+ } else if (opts2 & RD_IPV6_CS) {
if ((opts2 & RD_UDP_CS) && !(opts3 & UDPF))
checksum = CHECKSUM_UNNECESSARY;
else if ((opts2 & RD_TCP_CS) && !(opts3 & TCPF))


Attachments:
drivers_net_usb_r8152_checksums.patch (1.28 kB)

2016-10-26 22:54:33

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] drivers/net/usb/r8152 fix broken rx checksums

On 16-10-26 06:36 PM, Mark Lord wrote:
> The r8152 driver has been broken since (approx) 3.6.16,

Correction: broken since 3.16.xx.

> when support was added for hardware rx checksum on newer chip versions.
> Symptoms include random segfaults and silent data corruption over NFS.
>
> This does not work on the VER_02 dongle I have here
> when used with a slow embedded system CPU.
> Google reveals others reporting similar issues on Raspberry Pi.
>
> So, disable hardware rx checksum for VER_02, and fix
> an obvious coding error for IPV6 checksums in the same function.
>
> Because this bug results in silent data corruption,
> it is a good candidate for back-porting to -stable >= 3.16.xx.
> Patch attached (to deal with buggy mailer) and also below for review.
>
> Signed-off-by: Mark Lord <[email protected]>
>
> --- old/drivers/net/usb/r8152.c 2016-09-30 04:20:43.000000000 -0400
> +++ linux/drivers/net/usb/r8152.c 2016-10-26 14:15:44.932517676 -0400
> @@ -1645,7 +1645,7 @@
> u8 checksum = CHECKSUM_NONE;
> u32 opts2, opts3;
>
> - if (tp->version == RTL_VER_01)
> + if (tp->version == RTL_VER_01 || tp->version == RTL_VER_02)
> goto return_result;
>
> opts2 = le32_to_cpu(rx_desc->opts2);
> @@ -1660,7 +1660,7 @@
> checksum = CHECKSUM_NONE;
> else
> checksum = CHECKSUM_UNNECESSARY;
> - } else if (RD_IPV6_CS) {
> + } else if (opts2 & RD_IPV6_CS) {
> if ((opts2 & RD_UDP_CS) && !(opts3 & UDPF))
> checksum = CHECKSUM_UNNECESSARY;
> else if ((opts2 & RD_TCP_CS) && !(opts3 & TCPF))


--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2016-10-30 21:23:03

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] drivers/net/usb/r8152 fix broken rx checksums

From: Mark Lord <[email protected]>
Date: Wed, 26 Oct 2016 18:36:57 -0400

> Patch attached (to deal with buggy mailer) and also below for review.

Please make your mailer work properly so that you can submit
patches properly which work inline, just like every other developer
does for the kernel.

Also please format your Subject line properly, it must be of the
form:

[PATCH net] r8152: Fix broken RX checksums.

The important parts are:

1) "[PATCH net]" This says that it is a patch, and that it
is targetting the 'net' GIT tree specifically.

2) "r8152: " This indicates the "subsystem" that the patch
specifically targets, in this case the r8152 driver. It
must end with a colon character then a space.

3) "Fix broken RX checksums." Commit header lines and commit
messages are proper English, therefore sentences should
begin with a capitalized letter and end with a period.

Thanks.

2016-10-30 23:28:33

by Mark Lord

[permalink] [raw]
Subject: [PATCH net] r8152: Fix broken RX checksums.

The r8152 driver has been broken since (approx) 3.16.xx
when support was added for hardware RX checksums
on newer chip versions. Symptoms include random
segfaults and silent data corruption over NFS.

The hardware checksum logig does not work on the VER_02
dongles I have here when used with a slow embedded system CPU.
Google reveals others reporting similar issues on Raspberry Pi.

So, disable hardware RX checksum support for VER_02, and fix
an obvious coding error for IPV6 checksums in the same function.

Because this bug results in silent data corruption,
it is a good candidate for back-porting to -stable >= 3.16.xx.

Signed-off-by: Mark Lord <[email protected]>

--- old/drivers/net/usb/r8152.c 2016-09-30 04:20:43.000000000 -0400
+++ linux/drivers/net/usb/r8152.c 2016-10-26 14:15:44.932517676 -0400
@@ -1645,7 +1645,7 @@
u8 checksum = CHECKSUM_NONE;
u32 opts2, opts3;

- if (tp->version == RTL_VER_01)
+ if (tp->version == RTL_VER_01 || tp->version == RTL_VER_02)
goto return_result;

opts2 = le32_to_cpu(rx_desc->opts2);
@@ -1660,7 +1660,7 @@
checksum = CHECKSUM_NONE;
else
checksum = CHECKSUM_UNNECESSARY;
- } else if (RD_IPV6_CS) {
+ } else if (opts2 & RD_IPV6_CS) {
if ((opts2 & RD_UDP_CS) && !(opts3 & UDPF))
checksum = CHECKSUM_UNNECESSARY;
else if ((opts2 & RD_TCP_CS) && !(opts3 & TCPF))

2016-10-30 23:48:36

by Paul Bolle

[permalink] [raw]
Subject: Re: [PATCH] drivers/net/usb/r8152 fix broken rx checksums

On Sun, 2016-10-30 at 17:22 -0400, David Miller wrote:
>         3) "Fix broken RX checksums."  Commit header lines and commit
>            messages are proper English, therefore sentences should
>            begin with a capitalized letter and end with a period.

Commit messages should be proper English. But commit header lines
should not end with a period. The vast majority doesn't. Yes, I've just
checked.

How many newspaper headlines end with a period?

Thanks,


Paul Bolle

2016-10-31 00:58:03

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

From: Mark Lord <[email protected]>
Date: Sun, 30 Oct 2016 19:28:27 -0400

> The r8152 driver has been broken since (approx) 3.16.xx
> when support was added for hardware RX checksums
> on newer chip versions. Symptoms include random
> segfaults and silent data corruption over NFS.
>
> The hardware checksum logig does not work on the VER_02
> dongles I have here when used with a slow embedded system CPU.
> Google reveals others reporting similar issues on Raspberry Pi.
>
> So, disable hardware RX checksum support for VER_02, and fix
> an obvious coding error for IPV6 checksums in the same function.
>
> Because this bug results in silent data corruption,
> it is a good candidate for back-porting to -stable >= 3.16.xx.
>
> Signed-off-by: Mark Lord <[email protected]>

Applied and queued up for -stable, thanks.

2016-10-31 02:08:13

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

On 16-10-30 08:57 PM, David Miller wrote:
> From: Mark Lord <[email protected]>
> Date: Sun, 30 Oct 2016 19:28:27 -0400
>
>> The r8152 driver has been broken since (approx) 3.16.xx
>> when support was added for hardware RX checksums
>> on newer chip versions. Symptoms include random
>> segfaults and silent data corruption over NFS.
>>
>> The hardware checksum logig does not work on the VER_02
>> dongles I have here when used with a slow embedded system CPU.
>> Google reveals others reporting similar issues on Raspberry Pi.
>>
>> So, disable hardware RX checksum support for VER_02, and fix
>> an obvious coding error for IPV6 checksums in the same function.
>>
>> Because this bug results in silent data corruption,
>> it is a good candidate for back-porting to -stable >= 3.16.xx.
>>
>> Signed-off-by: Mark Lord <[email protected]>
>
> Applied and queued up for -stable, thanks.

Thanks. Now that this is taken care of, I do wonder if perhaps
RX checksums ought to be enabled at all for ANY versions of this chip?

My theory is that the checksums probably work okay most of the time,
except when the hardware RX buffer overflows.

In my case, and in the case of the Raspberry Pi, the receiving CPU
is quite a bit slower than mainstream x86, so it can quite easily
fall behind in emptying the RX buffer on the chip.
The only indication this has happened may be an incorrect RX checksum.

This is only a theory, but I otherwise have trouble explaining
why we are seeing invalid RX checksums -- direct cable connections
to a switch, shared only with the NFS server. No reason for it
to have bad RX checksums in the first place.

Should we just blanket disable RX checksums for all versions here
unless proven otherwise/safe?

Anyone out there know better?

Cheers
Mark

2016-10-31 03:53:56

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

From: Mark Lord <[email protected]>
Date: Sun, 30 Oct 2016 22:07:25 -0400

> On 16-10-30 08:57 PM, David Miller wrote:
>> From: Mark Lord <[email protected]>
>> Date: Sun, 30 Oct 2016 19:28:27 -0400
>>
>>> The r8152 driver has been broken since (approx) 3.16.xx
>>> when support was added for hardware RX checksums
>>> on newer chip versions. Symptoms include random
>>> segfaults and silent data corruption over NFS.
>>>
>>> The hardware checksum logig does not work on the VER_02
>>> dongles I have here when used with a slow embedded system CPU.
>>> Google reveals others reporting similar issues on Raspberry Pi.
>>>
>>> So, disable hardware RX checksum support for VER_02, and fix
>>> an obvious coding error for IPV6 checksums in the same function.
>>>
>>> Because this bug results in silent data corruption,
>>> it is a good candidate for back-porting to -stable >= 3.16.xx.
>>>
>>> Signed-off-by: Mark Lord <[email protected]>
>>
>> Applied and queued up for -stable, thanks.
>
> Thanks. Now that this is taken care of, I do wonder if perhaps
> RX checksums ought to be enabled at all for ANY versions of this chip?

You should really start a dialogue with the developer who has been
making the most, if not all, of the major changes to this driver over
the past few years, Hayes Wang.

2016-10-31 08:14:54

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net] r8152: Fix broken RX checksums.

> >>> The r8152 driver has been broken since (approx) 3.16.xx
> >>> when support was added for hardware RX checksums
> >>> on newer chip versions. Symptoms include random
> >>> segfaults and silent data corruption over NFS.
> >>>
> >>> The hardware checksum logig does not work on the VER_02
> >>> dongles I have here when used with a slow embedded system CPU.
> >>> Google reveals others reporting similar issues on Raspberry Pi.
> >>>
> >>> So, disable hardware RX checksum support for VER_02, and fix
> >>> an obvious coding error for IPV6 checksums in the same function.
> >>>
> >>> Because this bug results in silent data corruption,
> >>> it is a good candidate for back-porting to -stable >= 3.16.xx.
> >>>
> >>> Signed-off-by: Mark Lord <[email protected]>
> >>
> >> Applied and queued up for -stable, thanks.
> >
> > Thanks. Now that this is taken care of, I do wonder if perhaps
> > RX checksums ought to be enabled at all for ANY versions of this chip?
>
> You should really start a dialogue with the developer who has been
> making the most, if not all, of the major changes to this driver over
> the past few years, Hayes Wang.

Our hw engineer says only VER_01 has the issue about rx checksum.
I need more information for checking it.

Best Regards,
Hayes

2016-10-31 13:24:28

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

On 16-10-31 04:14 AM, Hayes Wang wrote:
>
> Our hw engineer says only VER_01 has the issue about rx checksum.
> I need more information for checking it.

I've been doing driver work for Linux since 1991,
and learned long ago not to trust engineering specs 100%.

Get yourself a Raspberry Pi v1, set up an NFSROOT root filesystem for it,
and boot/run from that using the ethernet dongle to connect.

It should segfault like crazy when hardware RX checksums are enabled.

Definitely something wrong there, and whatever it is goes away
when RX checksums are disabled in the driver.

I have two theories as to why this happens:

1) The hardware buffer on the dongle overflows because the slow host CPU
does not empty it quickly enough. This results in a bad checksum on the
final/truncated packet in the buffer. The chip does not detect this.

2) Perhaps the device driver is looking at the wrong bits.

Either way, this results in data corruption and until otherwise fixed,
it is safest to just not enable RX checksums.

If it happens on a slow embedded CPU, then it can also happen on a heavily
loaded Intel/AMD CPU -- just a lot less frequently.

Cheers
--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2016-11-02 18:45:13

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

On 16-10-31 04:14 AM, Hayes Wang wrote:
>>>>> The r8152 driver has been broken since (approx) 3.16.xx
>>>>> when support was added for hardware RX checksums
>>>>> on newer chip versions. Symptoms include random
>>>>> segfaults and silent data corruption over NFS.
>>>>>
>>>>> The hardware checksum logig does not work on the VER_02
>>>>> dongles I have here when used with a slow embedded system CPU.
>>>>> Google reveals others reporting similar issues on Raspberry Pi.
...
> Our hw engineer says only VER_01 has the issue about rx checksum.
> I need more information for checking it.

I have poked at it some more, and thus far it appears that it is
only necessary to disable TCP rx checksums. The system doesn't crash
when only IP/UDP checksums are enabled, but does when TCP checksums are on.

This happens regardless of whether RX_AGG is disabled or enabled,
and increasing/decreasing the number of RX URBs (RTL8152_MAX_RX)
doesn't seem to affect it.

lsusb -vv (from an x86 system, not the failing embedded system) follows:

Bus 001 Device 004: ID 0bda:8152 Realtek Semiconductor Corp.
Device Descriptor:
bLength 18
bDescriptorType 1
bcdUSB 2.10
bDeviceClass 0 (Defined at Interface level)
bDeviceSubClass 0
bDeviceProtocol 0
bMaxPacketSize0 64
idVendor 0x0bda Realtek Semiconductor Corp.
idProduct 0x8152
bcdDevice 20.00
iManufacturer 1 Realtek
iProduct 2 USB 10/100 LAN
iSerial 3 84E71400257D
bNumConfigurations 2
Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength 39
bNumInterfaces 1
bConfigurationValue 1
iConfiguration 0
bmAttributes 0xa0
(Bus Powered)
Remote Wakeup
MaxPower 100mA
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
bNumEndpoints 3
bInterfaceClass 255 Vendor Specific Class
bInterfaceSubClass 255 Vendor Specific Subclass
bInterfaceProtocol 0
iInterface 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81 EP 1 IN
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x02 EP 2 OUT
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x83 EP 3 IN
bmAttributes 3
Transfer Type Interrupt
Synch Type None
Usage Type Data
wMaxPacketSize 0x0002 1x 2 bytes
bInterval 8
Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength 80
bNumInterfaces 2
bConfigurationValue 2
iConfiguration 0
bmAttributes 0xa0
(Bus Powered)
Remote Wakeup
MaxPower 100mA
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
bNumEndpoints 1
bInterfaceClass 2 Communications
bInterfaceSubClass 6 Ethernet Networking
bInterfaceProtocol 0
iInterface 5 CDC Communications Control
CDC Header:
bcdCDC 1.10
CDC Union:
bMasterInterface 0
bSlaveInterface 1
CDC Ethernet:
iMacAddress 3 84E71400257D
bmEthernetStatistics 0x00000000
wMaxSegmentSize 1514
wNumberMCFilters 0x0000
bNumberPowerFilters 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x83 EP 3 IN
bmAttributes 3
Transfer Type Interrupt
Synch Type None
Usage Type Data
wMaxPacketSize 0x0010 1x 16 bytes
bInterval 8
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 1
bAlternateSetting 0
bNumEndpoints 0
bInterfaceClass 10 CDC Data
bInterfaceSubClass 0 Unused
bInterfaceProtocol 0
iInterface 0
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 1
bAlternateSetting 1
bNumEndpoints 2
bInterfaceClass 10 CDC Data
bInterfaceSubClass 0 Unused
bInterfaceProtocol 0
iInterface 4 Ethernet Data
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81 EP 1 IN
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x02 EP 2 OUT
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Binary Object Store Descriptor:
bLength 5
bDescriptorType 15
wTotalLength 12
bNumDeviceCaps 1
USB 2.0 Extension Device Capability:
bLength 7
bDescriptorType 16
bDevCapabilityType 2
bmAttributes 0x00000002
Link Power Management (LPM) Supported
Device Status: 0x0000
(Bus Powered)

2016-11-03 08:56:50

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net] r8152: Fix broken RX checksums.

Mark Lord [mailto:[email protected]]
> Sent: Thursday, November 03, 2016 2:30 AM
> To: Hayes Wang; David Miller
[...]
> I have poked at it some more, and thus far it appears that it is
> only necessary to disable TCP rx checksums. The system doesn't crash
> when only IP/UDP checksums are enabled, but does when TCP checksums are on.
>
> This happens regardless of whether RX_AGG is disabled or enabled,
> and increasing/decreasing the number of RX URBs (RTL8152_MAX_RX)
> doesn't seem to affect it.

I test Raspberry Pi v1, but I couldn't boot with NFSROOT through
both onboard nic and RTL8152. I get following error.

VFS: Unable to mount root fs via NFS, trying floppy.
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(2,0)

However, if I start the system without NFSROOT, I could mount the nfs fs.
Any idea?

NFS server: Fedora 24
Raspberry Pi OS: 2016-09-23-raspbian-jessie

Content of /etc/exports:

/nfsexport *(rw,sync,no_root_squash)

I change the cmdline.txt from

dwc_otg.lpm_enable=0 console=serial0,115200 console=tty1
root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline
fsck.repair=yes rootwait quiet splash plymouth.ignore-serial-consoles

to

dwc_otg.lpm_enable=0 console=serial0,115200 console=tty1
root=/dev/nfs nfsroot=192.168.94.2:/nfsexport ip=192.168.94.22
rw rootwait quiet splash plymouth.ignore-serial-consoles

Best Regards,
Hayes

2016-11-03 11:43:42

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

On 16-11-03 04:56 AM, Hayes Wang wrote:
> Mark Lord [mailto:[email protected]]
>> Sent: Thursday, November 03, 2016 2:30 AM
>> To: Hayes Wang; David Miller
> [...]
>> I have poked at it some more, and thus far it appears that it is
>> only necessary to disable TCP rx checksums. The system doesn't crash
>> when only IP/UDP checksums are enabled, but does when TCP checksums are on.
>>
>> This happens regardless of whether RX_AGG is disabled or enabled,
>> and increasing/decreasing the number of RX URBs (RTL8152_MAX_RX)
>> doesn't seem to affect it.
>
> I test Raspberry Pi v1, but I couldn't boot with NFSROOT through
> both onboard nic and RTL8152. I get following error.
>
> VFS: Unable to mount root fs via NFS, trying floppy.
> Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(2,0)
>
> However, if I start the system without NFSROOT, I could mount the nfs fs.
> Any idea?

Rather than getting caught up in all of that,
you could then just chroot to the mounted nfs fs
at that point, and continue on from there.

Eg. chroot /mnt/nfsxxx /bin/sh

Running from NFS is probably not necessary though.
Instead, perhaps just run md5sum on every file on the nfs fs
from the Raspberry Pi, and then repeat the md5sum's on the server,
and compare the results for errors.

The system I am using the dongle with is a custom embedded board,
but I think the important thing is that it has a slow-ish CPU,
which means it is more prone to having the on-chip RX FIFO overflow.
It is also big-endian rather than little-endian, though that seems
to be correctly handled already in the device driver.

I will try the md5sum test on an x86 box for comparison.

Cheers
--
Mark Lord

2016-11-04 12:14:14

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

On 16-11-02 02:29 PM, Mark Lord wrote:
>
> I have poked at it some more, and thus far it appears that it is
> only necessary to disable TCP rx checksums. The system doesn't crash
> when only IP/UDP checksums are enabled, but does when TCP checksums are on.
>
> This happens regardless of whether RX_AGG is disabled or enabled,
> and increasing/decreasing the number of RX URBs (RTL8152_MAX_RX)
> doesn't seem to affect it.
>
..

I noticed that BIT(20) was not defined as anything for "opts3",
and so I added a line to rx_csum to check whether or not that bit
was ever set.

It triggered after a few thousand reset/reboot cycles with opts3
having the rather dubious looking value of an ASCII string: 0x5c7d7852.

So to me, it appears that the rx_desc's are getting mixed-up with data somewhere,
and when using hardware TCP checksums this doesn't get caught. So perhaps the
checksums themselves are fine, but there's another bug (driver or hardware?)
that sneaks through when not doing software checksums.




2016-11-04 13:50:37

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

Yeah, the device or driver is definitely getting confused with rx_desc structures.
I added code to check for unlikely rx_desc values, and it found this for starters:

rx_desc: 00480801 00480401 00480001 0048fc00 0048f800 0048f400 pkt_len=2045
rx_data: 00 f0 48 00 00 ec 48 00 00 e8 48 00 00 e4 48 00 00 e0 48 00 00 dc 48 00 00 d8 48 00 00 d4 48 00
rx_data: 00 d0 48 00 00 cc 48 00 00 c8 48 00 00 c4 48 00 00 c0 48 00 00 bc 48 00 00 b8 48 00 00 b4 48 00
rx_data: 00 b0 48 00 00 ac 48 00 00 01 00 00 81 ed 00 00 00 01 00 00 00 00 00 00 00 00 00 02 4d ac 00 00
rx_data: 10 00 ff ff ff ff 00 00 01 28 83 d6 ff 6d 00 20 25 b1 58 1b 68 ff 00 05 20 01 56 41 17 35 00 00
...

The MTU/MRU on this link is the standard 1500 bytes, so a pkt_len of 2045 isn't valid here.
And the rx_desc values look an awful lot like the rx_data values that follow it.

There's definitely more broken here than just TCP RX checksums.

-ml

2016-11-04 20:24:05

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

On 16-11-04 09:50 AM, Mark Lord wrote:
> Yeah, the device or driver is definitely getting confused with rx_desc structures.
> I added code to check for unlikely rx_desc values, and it found this for starters:
>
> rx_desc: 00480801 00480401 00480001 0048fc00 0048f800 0048f400 pkt_len=2045
> rx_data: 00 f0 48 00 00 ec 48 00 00 e8 48 00 00 e4 48 00 00 e0 48 00 00 dc 48 00 00 d8 48 00 00 d4
> 48 00
> rx_data: 00 d0 48 00 00 cc 48 00 00 c8 48 00 00 c4 48 00 00 c0 48 00 00 bc 48 00 00 b8 48 00 00 b4
> 48 00
> rx_data: 00 b0 48 00 00 ac 48 00 00 01 00 00 81 ed 00 00 00 01 00 00 00 00 00 00 00 00 00 02 4d ac
> 00 00
> rx_data: 10 00 ff ff ff ff 00 00 01 28 83 d6 ff 6d 00 20 25 b1 58 1b 68 ff 00 05 20 01 56 41 17 35
> 00 00
> ...
>
> The MTU/MRU on this link is the standard 1500 bytes, so a pkt_len of 2045 isn't valid here.
> And the rx_desc values look an awful lot like the rx_data values that follow it.
>
> There's definitely more broken here than just TCP RX checksums.

I spent a bit more time on this again today, and made progress.
The issue seems to be stale rx buffers.
I'll discuss further offline with Hayes Wang.

--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2016-11-09 13:10:14

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net] r8152: Fix broken RX checksums.

Mark Lord [mailto:[email protected]]
> Sent: Friday, November 04, 2016 9:50 PM
[...]
> Yeah, the device or driver is definitely getting confused with rx_desc structures.
> I added code to check for unlikely rx_desc values, and it found this for starters:
>
> rx_desc: 00480801 00480401 00480001 0048fc00 0048f800 0048f400
> pkt_len=2045
> rx_data: 00 f0 48 00 00 ec 48 00 00 e8 48 00 00 e4 48 00 00 e0 48 00 00 dc 48 00
> 00 d8 48 00 00 d4 48 00
> rx_data: 00 d0 48 00 00 cc 48 00 00 c8 48 00 00 c4 48 00 00 c0 48 00 00 bc 48 00
> 00 b8 48 00 00 b4 48 00
> rx_data: 00 b0 48 00 00 ac 48 00 00 01 00 00 81 ed 00 00 00 01 00 00 00 00 00 00
> 00 00 00 02 4d ac 00 00
> rx_data: 10 00 ff ff ff ff 00 00 01 28 83 d6 ff 6d 00 20 25 b1 58 1b 68 ff 00 05 20 01
> 56 41 17 35 00 00
> ...
>
> The MTU/MRU on this link is the standard 1500 bytes, so a pkt_len of 2045 isn't
> valid here.
> And the rx_desc values look an awful lot like the rx_data values that follow it.
>
> There's definitely more broken here than just TCP RX checksums.

I don't think it is the issue of our hw. If it happens, windows or
other OS may have problems, too. It is like the memory issue described
in commit 990c9b347245("Merge branch 'r8152-fixes'"). It seems that
the data in memory is not same with the one from the device.

Besides, I test the raspberry pi with RTL8152. However, I don't find
any checksum issue for TCP. I try to copy a large file and md5sum it
through NFS. It works fine.

Best Regards,
Hayes


2016-11-09 13:19:37

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net] r8152: Fix broken RX checksums.

On 16-11-09 08:09 AM, Hayes Wang wrote:
> Mark Lord [mailto:[email protected]]
..
>> The MTU/MRU on this link is the standard 1500 bytes, so a pkt_len of 2045 isn't
>> valid here.
>> And the rx_desc values look an awful lot like the rx_data values that follow it.
>>
>> There's definitely more broken here than just TCP RX checksums.
>
> I don't think it is the issue of our hw. If it happens, windows or
> other OS may have problems, too. It is like the memory issue described
> in commit 990c9b347245("Merge branch 'r8152-fixes'"). It seems that
> the data in memory is not same with the one from the device.

I am still doing long-term testing of various tweaks to the driver,
and can now confirm that changing from kmalloc() to usb_alloc_coherent()
vastly improves reliability, and re-enabling RX checksums works fine
with that change.

However, even with coherent URB buffers, I still see the occasional bad rx_desc:
like, twice in 36 hours of continuous bashing at it.

So having code in the driver to sanitize the rx_desc is essential.
My current test code (shared with Hayes already) includes validation of various
key fields of the rx_desc, and detects when the chip/driver/whatever gets confused.

Hopefully r8152.c will get updated to take more care before trusting
what it sees in the rx_desc fields.

Cheers
--
Mark Lord
[email protected]