2016-11-11 07:16:11

by Hayes Wang

[permalink] [raw]
Subject: [PATCH net 0/2] r8152: rx patches

Let the rx sw checksum available and add some checks for rx desc.

Hayes Wang (2):
r8152: fix the sw rx checksum is unavailable
r8152: rx descriptor check

drivers/net/usb/r8152.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 45 insertions(+), 1 deletion(-)

--
2.7.4


2016-11-11 07:16:21

by Hayes Wang

[permalink] [raw]
Subject: [PATCH net 2/2] r8152: rx descriptor check

For some platforms, the data in memory is not the same with the one
from the device. That is, the data of memory is unbelievable. The
check is used to find out this situation.

Signed-off-by: Hayes Wang <[email protected]>
---
drivers/net/usb/r8152.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 0e42a78..e766121 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -1756,6 +1756,43 @@ static u8 r8152_rx_csum(struct r8152 *tp, struct rx_desc *rx_desc)
return checksum;
}

+static int invalid_rx_desc(struct r8152 *tp, struct rx_desc *rx_desc)
+{
+ u32 opts1 = le32_to_cpu(rx_desc->opts1);
+ u32 opts2 = le32_to_cpu(rx_desc->opts2);
+ unsigned int pkt_len = opts1 & RX_LEN_MASK;
+
+ switch (tp->version) {
+ case RTL_VER_01:
+ case RTL_VER_02:
+ if (pkt_len > RTL8152_RMS)
+ return -EIO;
+ break;
+ default:
+ if (pkt_len > RTL8153_RMS)
+ return -EIO;
+ break;
+ }
+
+ switch (opts2 & (RD_IPV4_CS | RD_IPV6_CS)) {
+ case (RD_IPV4_CS | RD_IPV6_CS):
+ return -EIO;
+ case RD_IPV4_CS:
+ case RD_IPV6_CS:
+ switch (opts2 & (RD_UDP_CS | RD_TCP_CS)) {
+ case (RD_UDP_CS | RD_TCP_CS):
+ return -EIO;
+ default:
+ break;
+ }
+ break;
+ default:
+ break;
+ }
+
+ return 0;
+}
+
static int rx_bottom(struct r8152 *tp, int budget)
{
unsigned long flags;
@@ -1812,6 +1849,18 @@ static int rx_bottom(struct r8152 *tp, int budget)
unsigned int pkt_len;
struct sk_buff *skb;

+ if (unlikely(invalid_rx_desc(tp, rx_desc))) {
+ if (net_ratelimit())
+ netif_err(tp, rx_err, netdev,
+ "Memory unbelievable\n");
+ if (tp->netdev->features & NETIF_F_RXCSUM) {
+ tp->netdev->features &= ~NETIF_F_RXCSUM;
+ netif_err(tp, rx_err, netdev,
+ "rx checksum off\n");
+ }
+ break;
+ }
+
pkt_len = le32_to_cpu(rx_desc->opts1) & RX_LEN_MASK;
if (pkt_len < ETH_ZLEN)
break;
--
2.7.4

2016-11-11 07:16:19

by Hayes Wang

[permalink] [raw]
Subject: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Fix the hw rx checksum is always enabled, and the user couldn't switch
it to sw rx checksum.

Note that the RTL_VER_01 only supports sw rx checksum only. Besides,
the hw rx checksum for RTL_VER_02 is disabled after
commit b9a321b48af4 ("r8152: Fix broken RX checksums."). Re-enable it.

Signed-off-by: Hayes Wang <[email protected]>
---
drivers/net/usb/r8152.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 75c5168..0e42a78 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -1730,7 +1730,7 @@ static u8 r8152_rx_csum(struct r8152 *tp, struct rx_desc *rx_desc)
u8 checksum = CHECKSUM_NONE;
u32 opts2, opts3;

- if (tp->version == RTL_VER_01 || tp->version == RTL_VER_02)
+ if (!(tp->netdev->features & NETIF_F_RXCSUM))
goto return_result;

opts2 = le32_to_cpu(rx_desc->opts2);
@@ -4307,6 +4307,11 @@ static int rtl8152_probe(struct usb_interface *intf,
NETIF_F_HIGHDMA | NETIF_F_FRAGLIST |
NETIF_F_IPV6_CSUM | NETIF_F_TSO6;

+ if (tp->version == RTL_VER_01) {
+ netdev->features &= ~NETIF_F_RXCSUM;
+ netdev->hw_features &= ~NETIF_F_RXCSUM;
+ }
+
netdev->ethtool_ops = &ops;
netif_set_gso_max_size(netdev, RTL_LIMITED_TSO_SIZE);

--
2.7.4

2016-11-11 12:13:23

by Francois Romieu

[permalink] [raw]
Subject: Re: [PATCH net 2/2] r8152: rx descriptor check

Hayes Wang <[email protected]> :
> For some platforms, the data in memory is not the same with the one
> from the device. That is, the data of memory is unbelievable. The
> check is used to find out this situation.

Invalid packet size corrupted receive descriptors in Realtek's device
reminds of CVE-2009-4537.

Is the silicium of both devices different enough to prevent the same
exploit to happen ?

--
Ueimor

2016-11-12 13:21:31

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 2/2] r8152: rx descriptor check

On 16-11-11 07:13 AM, Francois Romieu wrote:
> Hayes Wang <[email protected]> :
>> For some platforms, the data in memory is not the same with the one
>> from the device. That is, the data of memory is unbelievable. The
>> check is used to find out this situation.
>
> Invalid packet size corrupted receive descriptors in Realtek's device
> reminds of CVE-2009-4537.
>
> Is the silicium of both devices different enough to prevent the same
> exploit to happen ?

I don't know if the hardware can do it, but the existing Linux device
driver regularly attempts to process huge unreal packet sizes here.
I've had to patch it to reject "packets" larger than the configured MRU.
--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2016-11-13 17:39:58

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 2/2] r8152: rx descriptor check

From: Hayes Wang <[email protected]>
Date: Fri, 11 Nov 2016 15:15:41 +0800

> For some platforms, the data in memory is not the same with the one
> from the device. That is, the data of memory is unbelievable. The
> check is used to find out this situation.
>
> Signed-off-by: Hayes Wang <[email protected]>

I'm all for adding consistency checks, but I disagree with proceeding
in this manner for this.

If you add this patch now, there is a much smaller likelyhood that you
will work with a high priority to figure out _why_ this is happening.

For all we know this could be a platform bug in the DMA API for the
systems in question.

It could also be a bug elsewhere in the driver, either in setting up
the descriptor DMA mappings or how the chip is programmed.

Either way the true cause must be found before we start throwing
changes like this into the driver.

I'm not applying this series, sorry.

2016-11-13 20:34:13

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 2/2] r8152: rx descriptor check

On 16-11-13 12:39 PM, David Miller wrote:
> From: Hayes Wang <[email protected]>
> Date: Fri, 11 Nov 2016 15:15:41 +0800
>
>> For some platforms, the data in memory is not the same with the one
>> from the device. That is, the data of memory is unbelievable. The
>> check is used to find out this situation.
>>
>> Signed-off-by: Hayes Wang <[email protected]>
>
> I'm all for adding consistency checks, but I disagree with proceeding
> in this manner for this.
>
> If you add this patch now, there is a much smaller likelyhood that you
> will work with a high priority to figure out _why_ this is happening.
>
> For all we know this could be a platform bug in the DMA API for the
> systems in question.
>
> It could also be a bug elsewhere in the driver, either in setting up
> the descriptor DMA mappings or how the chip is programmed.
>
> Either way the true cause must be found before we start throwing
> changes like this into the driver.

I agree.

The system I use it with is a 32-bit ppc476, with non-coherent RAM,
and using 16KB page sizes.

The dongle instantly becomes a lot more reliable when r8152.c is updated
to use usb_alloc_coherent() for URB buffers, rather than kmalloc().

Not sure why that would be though, as the USB stack normally would handle
kmalloc'd buffers just fine. It is calling the appropriate routines,
which boil down to invalidating the dcache lines (for inbound bulk xfers)
as part of usb_submit_urb(), and yet the problem there persists.

It could be caused by cache-line sharing with other allocations, but that seems
unlikely as the kmalloc() size is 16384 bytes per buffer. Perhaps the driver
is somehow accessing the buffer space again after doing usb_submit_urb()?
That would certainly produce this kind of behaviour.

Or maybe there's just a memory barrier missing somewhere in path.

The really weird thing is that ASIX-based dongles (which use a different driver)
don't have this problem, and yet they also use kmalloc'd buffers.

I have access to the test system only for a day or two a week,
and it takes a few hours to do a good test as to whether something helps or not.
I'll continue to poke at it as time and New Ideas permit.

New Ideas welcome!
--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2016-11-13 20:38:33

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 2/2] r8152: rx descriptor check

On 16-11-13 03:34 PM, Mark Lord wrote:
>
> The system I use it with is a 32-bit ppc476, with non-coherent RAM,
> and using 16KB page sizes.
>
> The dongle instantly becomes a lot more reliable when r8152.c is updated
> to use usb_alloc_coherent() for URB buffers, rather than kmalloc().
>
> Not sure why that would be though, as the USB stack normally would handle
> kmalloc'd buffers just fine. It is calling the appropriate routines,
> which boil down to invalidating the dcache lines (for inbound bulk xfers)
> as part of usb_submit_urb(), and yet the problem there persists.
>
> It could be caused by cache-line sharing with other allocations, but that seems
> unlikely as the kmalloc() size is 16384 bytes per buffer. Perhaps the driver
> is somehow accessing the buffer space again after doing usb_submit_urb()?
> That would certainly produce this kind of behaviour.
>
> Or maybe there's just a memory barrier missing somewhere in path.
>
> The really weird thing is that ASIX-based dongles (which use a different driver)
> don't have this problem, and yet they also use kmalloc'd buffers.
>
> I have access to the test system only for a day or two a week,
> and it takes a few hours to do a good test as to whether something helps or not.
> I'll continue to poke at it as time and New Ideas permit.

Oh, and the problems did not exist with the 3.14.xx kernels and earlier.
They began to show up when we tried 3.16.xx and all newer kernels.

The difference there is that RX checksums were enabled in hardware as of 3.16.xx,
and thus the network stack began accepting bad packets from the r8152 driver.

I don't know if the ASIX driver uses hardware checksums or just software checksums.
That might explain why it is more reliable here.
--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2016-11-14 06:44:11

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 2/2] r8152: rx descriptor check

Francois Romieu [mailto:[email protected]]
> Sent: Friday, November 11, 2016 8:13 PM
[...]
> Invalid packet size corrupted receive descriptors in Realtek's device
> reminds of CVE-2009-4537.

Do you mean that the driver would get a packet exceed the size
which is set to RxMaxSize? I check it with our hw engineers.
They don't get any issue about RxMaxSize. And their test for
RxMaxSize register is fine.

> Is the silicium of both devices different enough to prevent the same
> exploit to happen ?

For this case, I don't think the device provide a invalid value
for the receive descriptors. However, the driver sees a different
value. That is why I say the memory is unbelievable.

Best Regards,
Hayes

2016-11-14 07:04:17

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 2/2] r8152: rx descriptor check

David Miller [mailto:[email protected]]
> Sent: Monday, November 14, 2016 1:40 AM
[...]
> If you add this patch now, there is a much smaller likelyhood that you
> will work with a high priority to figure out _why_ this is happening.
>
> For all we know this could be a platform bug in the DMA API for the
> systems in question.
>
> It could also be a bug elsewhere in the driver, either in setting up
> the descriptor DMA mappings or how the chip is programmed.
>
> Either way the true cause must be found before we start throwing
> changes like this into the driver.

Our hw engineer could check our device, and I could check the
driver. However, for the other parts, such as the USB host
controller or memory, it is difficult for me to make sure whether
they are correct or not. I could only promise our devices and
driver work fine.

Best Regards,
Hayes

2016-11-14 07:24:13

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 2/2] r8152: rx descriptor check

Mark Lord [mailto:[email protected]]
> Sent: Monday, November 14, 2016 4:34 AM
[...]
> Perhaps the driver
> is somehow accessing the buffer space again after doing usb_submit_urb()?
> That would certainly produce this kind of behaviour.

I don't think so. First, the driver only read the received buffer.
That is, the driver would not change (or write) the data. Second,
The driver would lose the point address of the received buffer
after submitting the urb to the USB host controller, until the
transfer is completed by the USB host controller. That is, the
driver doesn't how to access the buffer after calling usb_submit_urb().

Best Regards,
Hayes


2016-11-14 17:27:52

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 2/2] r8152: rx descriptor check

From: Hayes Wang <[email protected]>
Date: Mon, 14 Nov 2016 07:23:51 +0000

> Mark Lord [mailto:[email protected]]
>> Sent: Monday, November 14, 2016 4:34 AM
> [...]
>> Perhaps the driver
>> is somehow accessing the buffer space again after doing usb_submit_urb()?
>> That would certainly produce this kind of behaviour.
>
> I don't think so. First, the driver only read the received buffer.
> That is, the driver would not change (or write) the data. Second,
> The driver would lose the point address of the received buffer
> after submitting the urb to the USB host controller, until the
> transfer is completed by the USB host controller. That is, the
> driver doesn't how to access the buffer after calling usb_submit_urb().

This is why it's most likely some DMA implementation issue or similar.

2016-11-15 01:10:57

by Francois Romieu

[permalink] [raw]
Subject: Re: [PATCH net 2/2] r8152: rx descriptor check

Hayes Wang <[email protected]> :
> Francois Romieu [mailto:[email protected]]
> > Sent: Friday, November 11, 2016 8:13 PM
> [...]
> > Invalid packet size corrupted receive descriptors in Realtek's device
> > reminds of CVE-2009-4537.
>
> Do you mean that the driver would get a packet exceed the size
> which is set to RxMaxSize ?

If it was possible to get it wrong once, it should be possible to
get it wrong twice, especially if some part of the hardware design
is recycled. I don't mean anything else.

I won't speculate about some cache consistency issue or some badly
aborted dma transaction to explain the memory corruption.

--
Ueimor

2016-11-17 03:05:31

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 2/2] r8152: rx descriptor check

Francois Romieu [mailto:[email protected]]
> Sent: Tuesday, November 15, 2016 9:11 AM
[...]
> If it was possible to get it wrong once, it should be possible to
> get it wrong twice, especially if some part of the hardware design
> is recycled. I don't mean anything else.

I agree with you. However, I have to let it could be reproduced
for confirming it.

Besides, the behavior is different for PCIe and USB device. There
is no action of DMA for USB device. It is done by the USB host
controller. And, the USB host controller wouldn't allow the device
sends a data which is more than the size of the buffer.

Best Regards,
Hayes

2016-11-17 03:36:17

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

[...]
> Fix the hw rx checksum is always enabled, and the user couldn't switch
> it to sw rx checksum.
>
> Note that the RTL_VER_01 only supports sw rx checksum only. Besides,
> the hw rx checksum for RTL_VER_02 is disabled after
> commit b9a321b48af4 ("r8152: Fix broken RX checksums."). Re-enable it.

Excuse me. If I want to re-send this one patch, should I let
RTL_VER_02 use rx hw checksum?

Best Regards,
Hayes

2016-11-17 14:29:56

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

(resending.. not sure if the original had mailer errors)

On 16-11-16 10:36 PM, Hayes Wang wrote:
> [...]
>> Fix the hw rx checksum is always enabled, and the user couldn't switch
>> it to sw rx checksum.
>>
>> Note that the RTL_VER_01 only supports sw rx checksum only. Besides,
>> the hw rx checksum for RTL_VER_02 is disabled after
>> commit b9a321b48af4 ("r8152: Fix broken RX checksums."). Re-enable it.
>
> Excuse me. If I want to re-send this one patch, should I let
> RTL_VER_02 use rx hw checksum?

Definitely NOT.

I am still doing low-level tracing through the driver as time permits,
and just now found some really interesting evidence.

Using coherent buffers (non-cacheable, allocated with usb_alloc_coherent),
I can get it to fail extremely regularly by simply reducing the buffer size
(agg_buf_sz) from 16KB down to 4KB. This makes reproducing the issue
much much easier -- the same problems do happen with the larger 16KB size,
but much less often than with smaller sizes.

So.. with a 4KB URB transfer_buffer size, along with a ton of added error-checking,
I see this behaviour every 10 (rx) URBs or so:

First URB (number 593):
[ 34.260667] r8152_rx_bottom: 593 corrupted urb: head=bf014000 urb_offset=2856/4096 pkt_len(1518) exceeds remainder(1216)
[ 34.271931] r8152_dump_rx_desc: 044805ee 40080000 006005dc 06020000 00000000 00000000 rx_len=1518

Next URB (number 594):
[ 34.281172] r8152_check_rx_desc: rx_desc looks bad.
[ 34.286228] r8152_rx_bottom: 594 corrupted urb. head=bf018000 urb_offset=0/304 len_used=24
[ 34.294774] r8152_dump_rx_desc: 00008300 00008400 00008500 00008600 00008700 00008800 rx_len=768

What the above sample shows, is the URB transfer buffer ran out of space in the middle
of a packet, and the hardware then tried to just continue that same packet in the next URB,
without an rx_desc header inserted. The r8152.c driver always assumes the URB buffer begins
with an rx_desc, so of course this behaviour produces really weird effects, and system crashes, etc..

So until that driver bug is addressed, I would advise disabling hardware RX checksums
for all chip versions, not only for version 02.

It is not clear to me how the chip decides when to forward an rx URB to the host.
If you could describe how that part works for us, then it would help in further
understanding why fast systems (eg. a PC) don't generally notice the issue,
while much slower embedded systems do see the issue regularly.

Thanks
Mark

2016-11-17 14:29:41

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-17 09:14 AM, Mark Lord wrote:
..
> Using coherent buffers (non-cacheable, allocated with usb_alloc_coherent),

Note that the same behaviour also happens with the original kmalloc'd buffers.

> I can get it to fail extremely regularly by simply reducing the buffer size
> (agg_buf_sz) from 16KB down to 4KB. This makes reproducing the issue
> much much easier -- the same problems do happen with the larger 16KB size,
> but much less often than with smaller sizes.

Increasing the buffer size to 64KB makes the problem much less frequent,
as one might expect. Thus far I haven't seen it happen at all, but a longer
run (1-3 days) is needed to make sure. This however is NOT a "fix".

> So.. with a 4KB URB transfer_buffer size, along with a ton of added error-checking,
> I see this behaviour every 10 (rx) URBs or so:
>
> First URB (number 593):
> [ 34.260667] r8152_rx_bottom: 593 corrupted urb: head=bf014000 urb_offset=2856/4096 pkt_len(1518) exceeds remainder(1216)
> [ 34.271931] r8152_dump_rx_desc: 044805ee 40080000 006005dc 06020000 00000000 00000000 rx_len=1518
>
> Next URB (number 594):
> [ 34.281172] r8152_check_rx_desc: rx_desc looks bad.
> [ 34.286228] r8152_rx_bottom: 594 corrupted urb. head=bf018000 urb_offset=0/304 len_used=24
> [ 34.294774] r8152_dump_rx_desc: 00008300 00008400 00008500 00008600 00008700 00008800 rx_len=768
>
> What the above sample shows, is the URB transfer buffer ran out of space in the middle
> of a packet, and the hardware then tried to just continue that same packet in the next URB,
> without an rx_desc header inserted. The r8152.c driver always assumes the URB buffer begins
> with an rx_desc, so of course this behaviour produces really weird effects, and system crashes, etc..
>
> So until that driver bug is addressed, I would advise disabling hardware RX checksums
> for all chip versions, not only for version 02.
>
> It is not clear to me how the chip decides when to forward an rx URB to the host.
> If you could describe how that part works for us, then it would help in further
> understanding why fast systems (eg. a PC) don't generally notice the issue,
> while much slower embedded systems do see the issue regularly.

That last part is critical to understanding things:
How does the chip decide that a URB is "full enough" before sending it to the host?
Why does a really fast host see fewer packets jammed together into a single URB than a slower host?

The answers will help understand if there are more bugs to be found/fixed,
or if everything is explained by what has been observed thus far.

To recap: the hardware sometimes fills a URB to the very end, and then continues the
current packet at the first byte of the following URB. The r8152.c driver does NOT
handle this situation; instead it always interprets the first 24 bytes of every URB
as an "rx_desc" structure, without any kind of sanity/validation. This results in
buffer overruns (it trusts the packet length field, even though the URB is too small
to hold such a packet), and other semi-random behaviour.

Using software rx checksums prevents Bad Things(tm) happening from most of this,
but even that is not perfect given the severity of the bug.

Cheers

2016-11-18 07:57:54

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord [mailto:[email protected]]
> Sent: Thursday, November 17, 2016 9:42 PM
[...]
> What the above sample shows, is the URB transfer buffer ran out of space in the
> middle
> of a packet, and the hardware then tried to just continue that same packet in the
> next URB,
> without an rx_desc header inserted. The r8152.c driver always assumes the URB
> buffer begins
> with an rx_desc, so of course this behaviour produces really weird effects, and
> system crashes, etc..

The USB device wouldn't know the address and size of buffer. Only
the USB host controller knows. Therefore, the device sends the
data to host, and the host fills the memory. According to your
description, it seems the host splits the data from the device
into two different buffers (or URB transfers). I wonder if it would
occur. As far as I know, the host wouldn't allow the buffer size
less than the data length.

Our hw engineers need the log from the USB analyzer to confirm
what the device sends to the host. However, I don't think you
have USB analyzer to do this. I would try to reproduce the issue.
But, I am busy, so I don't think I would response quickly.

Besides, the maximum data length which the RTL8152 would send to
the host is 16KB. That is, if the agg_buf_sz is 16KB, the host
wouldn't split it. However, you still see problems for it.

[...]
> It is not clear to me how the chip decides when to forward an rx URB to the host.
> If you could describe how that part works for us, then it would help in further
> understanding why fast systems (eg. a PC) don't generally notice the issue,
> while much slower embedded systems do see the issue regularly.

The driver expects the rx buffer would be

rx_desc + a packet + padding to 8 alignment +
rx_desc + a packet + padding to 8 alignment + ...

Therefore, when a urb transfer is completed, the driver parsers
the buffer by this way. After the buffer is handled, it would
be submitted to the host, until the transfer is completed again.
If the submitting fail, the driver would try again later. The
urb->actual_length means how much data the host fills. The drive
uses it to check the end of the data. The urb->status mean if
the transfer is successful. The driver submits the urb to the
host directly if the status is not successful.

Best Regards,
Hayes

2016-11-18 12:03:36

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-18 02:57 AM, Hayes Wang wrote:
..
> Besides, the maximum data length which the RTL8152 would send to
> the host is 16KB. That is, if the agg_buf_sz is 16KB, the host
> wouldn't split it. However, you still see problems for it.

How does the RTL8152 know that the limit is 16KB,
rather than some other number? Is this a hardwired number
in the hardware, or is it a parameter that the software
sends to the chip during initialization?

I have a USB analyzer, but it is difficult to figure out how
to program an appropriate trigger point for the capture,
since the problem (with 16KB URBs) takes minutes to hours
or even days to trigger.

And the output from the analyzer is in some proprietary format.
The in-kernel software analzer could be useful, but I have never
figured out how to use it. :)

Since my earlier email, I have figured out another piece of the
puzzle with this dongle.

The first issue is that a packet sometimes begins in one URB,
and completes in the next URB, without an rx_desc at the start
of the second URB. This I have already reported earlier.

But the driver, as written, sometimes accesses bytes outside
of the 16KB URB buffer, because it trusts the non-existent
rx_desc in these cases, and also because it accesses bytes
from the rx_desc without first checking whether there is
sufficient remaining space in the URB to hold an rx_desc.

These incorrect accesses sometimes touch memory outside
of the URB buffer. Since the driver allocates all of its
rx URB buffers at once, they are highly likely to be
physically (and therefore virtually) adjacent in memory.

So mistakenly accessing beyond the end of one buffer will
often result in a read from memory of the next URB buffer.
Which causes a portion of it to be loaded in the the D-cache.

When that URB is subsequently filled by DMA, there then exists
a data-consistency issue: the D-cache contains stale information
from before the latest DMA cycle.

So this explains the strange memory behaviour observed earlier on.
When I add a call to invalidate_dcache_range() to the driver
just before it begins examining a new rx URB, the problems go away.
So this confirms the observations.

Using non-cacheable RAM also makes the problem go away.
But neither is a fix for the real buffer overrun accesses in the driver.

Fix the "packet spans URBs" bug, and fix the driver to ALWAYS
test lengths/ranges before accessing the actual buffer,
and everything should begin working reliably.

Cheers
--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2016-11-22 13:12:38

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-18 07:03 AM, Mark Lord wrote:
> On 16-11-18 02:57 AM, Hayes Wang wrote:
> ..
>> Besides, the maximum data length which the RTL8152 would send to
>> the host is 16KB. That is, if the agg_buf_sz is 16KB, the host
>> wouldn't split it. However, you still see problems for it.
>
> How does the RTL8152 know that the limit is 16KB,
> rather than some other number? Is this a hardwired number
> in the hardware, or is it a parameter that the software
> sends to the chip during initialization?
..
> The first issue is that a packet sometimes begins in one URB,
> and completes in the next URB, without an rx_desc at the start
> of the second URB. This I have already reported earlier.

Long run tests over the weekend, with the invalidate_dcache_range() call
before the inner loop of r8152_rx_bottom(), turned up a few instances
where packets were truncated inside a 16384 byte URB buffer, without filling the URB.

[10.293228] r8152_rx_bottom: 4278 corrupted urb: head=9d210000 urb_offset=2856/3376 pkt_len(1518) exceeds remainder(496)
[10.304523] r8152_dump_rx_desc: 044805ee 40080000 006005dc 06020000 00000000 00000000 rx_len=1518
..
[ 16.660431] r8152_rx_bottom: 7802 corrupted urb: head=9d1f8000 urb_offset=1544/2064 pkt_len(1518) exceeds remainder(496)
[ 16.671719] r8152_dump_rx_desc: 044805ee 40480000 004005dc 46020006 00000000 00000000 rx_len=1518

The r8152.c driver attempted to build skb's for the entire packet size,
even though the 1518-byte packets had only 496-bytes of data in the URB.
It is not clear what the chip did with the rest of the packets in question,
but the next URBs in each case began with a new/real rx_desc and new packet.

There were also unconnected events during the test runs where the
test code noticed totally invalid rx_desc structs in the middles of URBs.
The stock driver would again have attempted to treat those as "valid" (ugh).

..
[ 10.273906] r8152_check_rx_desc: rx_desc looks bad.
[ 10.279012] r8152_rx_bottom: 4338 corrupted urb. head=9d210000 urb_offset=2856/3376 len_used=2880
[ 10.288196] r8152_dump_rx_desc: 312e3239 382e3836 0a20382e 3d435253 3034336d 202f3a30 rx_len=12857

..
[ 7.184565] r8152_check_rx_desc: rx_desc looks bad.
[ 7.189657] r8152_rx_bottom: 1678 corrupted urb. head=9d210000 urb_offset=2856/3376 len_used=2880
[ 7.198852] r8152_dump_rx_desc: a1388402 803c9001 84380810 a67c5c4c a77c782b c64c782b rx_len=1026
..
[ 10.351251] r8152_check_rx_desc: rx_desc looks bad.
[ 10.356356] r8152_rx_bottom: 4397 corrupted urb. head=9d20c000 urb_offset=4400/7984 len_used=4424
[ 10.365543] r8152_dump_rx_desc: 312e3239 382e3836 0a20382e 3d435253 3034336d 202f3a30 rx_len=12857
..
[ 10.518119] r8152_check_rx_desc: rx_desc looks bad.
[ 10.523204] r8152_rx_bottom: 4458 corrupted urb. head=9d210000 urb_offset=4400/7984 len_used=4424
[ 10.532416] r8152_dump_rx_desc: 54544120 6e3d5352 636f6c6f 65762c6b 343d7372 6464612c rx_len=16672
..

> But the driver, as written, sometimes accesses bytes outside
> of the 16KB URB buffer, because it trusts the non-existent
> rx_desc in these cases, and also because it accesses bytes
> from the rx_desc without first checking whether there is
> sufficient remaining space in the URB to hold an rx_desc.
>
> These incorrect accesses sometimes touch memory outside
> of the URB buffer. Since the driver allocates all of its
> rx URB buffers at once, they are highly likely to be
> physically (and therefore virtually) adjacent in memory.
>
> So mistakenly accessing beyond the end of one buffer will
> often result in a read from memory of the next URB buffer.
> Which causes a portion of it to be loaded in the the D-cache.
>
> When that URB is subsequently filled by DMA, there then exists
> a data-consistency issue: the D-cache contains stale information
> from before the latest DMA cycle.
>
> So this explains the strange memory behaviour observed earlier on.
> When I add a call to invalidate_dcache_range() to the driver
> just before it begins examining a new rx URB, the problems go away.
> So this confirms the observations.
>
> Using non-cacheable RAM also makes the problem go away.
> But neither is a fix for the real buffer overrun accesses in the driver.
>
> Fix the "packet spans URBs" bug, and fix the driver to ALWAYS
> test lengths/ranges before accessing the actual buffer,
> and everything should begin working reliably.

2016-11-23 03:52:42

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord [mailto:[email protected]]
> Sent: Friday, November 18, 2016 8:03 PM
[..]
> How does the RTL8152 know that the limit is 16KB,
> rather than some other number? Is this a hardwired number
> in the hardware, or is it a parameter that the software
> sends to the chip during initialization?

It is the limitation of the hardware.

> I have a USB analyzer, but it is difficult to figure out how
> to program an appropriate trigger point for the capture,
> since the problem (with 16KB URBs) takes minutes to hours
> or even days to trigger.

It is good. Our hw engineers real want it. Maybe you could send
a specific packet, and trigger it. You could allocate a skb and
fill the data which you prefer, and call

skb_queue_tail(&tp->tx_queue, skb);

[...]
> The first issue is that a packet sometimes begins in one URB,
> and completes in the next URB, without an rx_desc at the start
> of the second URB. This I have already reported earlier.

However, our hw engineer says it wouldn't happen. Our hw always
sends rx_desc + packet + padding. The hw wouldn't split it to
two or more transmission. That is why I wonder who does it.

> But the driver, as written, sometimes accesses bytes outside
> of the 16KB URB buffer, because it trusts the non-existent
> rx_desc in these cases, and also because it accesses bytes
> from the rx_desc without first checking whether there is
> sufficient remaining space in the URB to hold an rx_desc.

I think I check them. According to the followning code,

list_for_each_safe(cursor, next, &rx_queue) {
struct rx_desc *rx_desc;
struct rx_agg *agg;
int len_used = 0;
struct urb *urb;
u8 *rx_data;

...

rx_desc = agg->head;
rx_data = agg->head;
len_used += sizeof(struct rx_desc); //<-- add the size of next rx_desc

while (urb->actual_length > len_used) {
struct net_device *netdev = tp->netdev;
struct net_device_stats *stats = &netdev->stats;
unsigned int pkt_len;
struct sk_buff *skb;

pkt_len = le32_to_cpu(rx_desc->opts1) & RX_LEN_MASK;
if (pkt_len < ETH_ZLEN)
break;

len_used += pkt_len;
if (urb->actual_length < len_used)
break;

pkt_len -= CRC_SIZE;
rx_data += sizeof(struct rx_desc);

...

find_next_rx:
rx_data = rx_agg_align(rx_data + pkt_len + CRC_SIZE);
rx_desc = (struct rx_desc *)rx_data;
len_used = (int)(rx_data - (u8 *)agg->head);
len_used += sizeof(struct rx_desc); //<-- add the size of next rx_desc
}

submit:
...
}

The while loop would check if the next rx_desc is inside the urb
buffer, because the len_used includes the size of the next rx_desc.
Then, in the while loop, the len_used adds the packet size and check
with urb->actual_length again. These make sure the rx_desc and the
packet are inside the urb buffer. Except the urb->actual_length
is more than agg_buf_sz. However, I don't think it would happen.

Best Regards,
Hayes


2016-11-23 13:41:34

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

What does this code do:

>static void r8153_set_rx_early_size(struct r8152 *tp)
>{
> u32 mtu = tp->netdev->mtu;
> u32 ocp_data = (agg_buf_sz - mtu - VLAN_ETH_HLEN - VLAN_HLEN) / 4;
>
> ocp_write_word(tp, MCU_TYPE_USB, USB_RX_EARLY_SIZE, ocp_data);
>}

How is ocp_data used by the hardware?
Shouldn't the calculation also include sizeof(rx_desc) in there somewhere?

Thanks
--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2016-11-23 15:12:40

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord [[email protected]]
[...]
> What does this code do:

> >static void r8153_set_rx_early_size(struct r8152 *tp)
> >{
> > u32 mtu = tp->netdev->mtu;
> > u32 ocp_data = (agg_buf_sz - mtu - VLAN_ETH_HLEN - VLAN_HLEN) / 4;
> >
> > ocp_write_word(tp, MCU_TYPE_USB, USB_RX_EARLY_SIZE, ocp_data);
> >}

This only works for RTL8153. However, what you use is RTL8152.
It is like delay completion. It is used to reduce the loading of CPU
by letting a transfer contain more data to reduce the number of
transfers.

> How is ocp_data used by the hardware?
> Shouldn't the calculation also include sizeof(rx_desc) in there somewhere?

The algorithm is from our hw engineers, and it should be

(agg_buf_sz - packet size) / 8

You could refer to commit a59e6d815226 ("r8152: correct the rx early size").

2016-11-23 19:29:50

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-23 10:12 AM, Hayes Wang wrote:
> Mark Lord [[email protected]]
> [...]
>> What does this code do:
>
>>> static void r8153_set_rx_early_size(struct r8152 *tp)
>>> {
>>> u32 mtu = tp->netdev->mtu;
>>> u32 ocp_data = (agg_buf_sz - mtu - VLAN_ETH_HLEN - VLAN_HLEN) / 4;
>>>
>>> ocp_write_word(tp, MCU_TYPE_USB, USB_RX_EARLY_SIZE, ocp_data);
>>> }
>
> This only works for RTL8153. However, what you use is RTL8152.
> It is like delay completion. It is used to reduce the loading of CPU
> by letting a transfer contain more data to reduce the number of
> transfers.
>
>> How is ocp_data used by the hardware?
>> Shouldn't the calculation also include sizeof(rx_desc) in there somewhere?
>
> The algorithm is from our hw engineers, and it should be
>
> (agg_buf_sz - packet size) / 8
>
> You could refer to commit a59e6d815226 ("r8152: correct the rx early size").

Thanks.

Right now I am working quite hard trying to narrow things down exactly.
You are correct that the driver does appear to be careful about accesses
beyond the filled portion of a URB buffer -- for some reason I thought
the original driver had issues there, but looking again it does not seem to.

One idea that is now looking more likely:
Things could be suffering from speculative CPU accesses to RAM
(the system here has non-coherent d-cache/RAM).
This could incorrectly pre-load data from adjacent URB buffers
into the d-cache, creating coherency issues. I am testing now
with cacheline-sized guard zones between the buffers to see if
that is the issue or not.

Worth repeating: other dongles we have tried, eg. those using the asix driver,
do not cause us any troubles here. Only the r8152 dongles do.

The other drivers do not use hardware checksums, so even if they did
incur similar bad packets, whatever the reason, those bad packets
would be detected/rejected by the Linux network stack (software checksums).
So everything appears to behave fine with them, as it does with
the r8152 driver when hardware checksums are disabled.

Still trying to understand exactly how these errors are happening.
It takes a very long time to do a conclusive test of anything here,
and I only have the hardware for a day or two a week.
So my apologies if I am slow in getting back to you on stuff.

Cheers



2016-11-24 03:25:04

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord [mailto:[email protected]]
> Sent: Thursday, November 24, 2016 3:30 AM
[...]
> Worth repeating: other dongles we have tried, eg. those using the asix driver,
> do not cause us any troubles here. Only the r8152 dongles do.

I couldn't tell you why you would see the problem. I have tested the
RTL8152 on raspberry pi platform with iperf more than 17 hours. And
I don't see any invalid rx descriptor. I don't think it really is the
issue about our hw.

Best Regards,
Hayes

2016-11-24 12:31:29

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-23 02:29 PM, Mark Lord wrote:
> On 16-11-23 10:12 AM, Hayes Wang wrote:
>> Mark Lord [[email protected]]
>> [...]
>>> What does this code do:
>>
>>>> static void r8153_set_rx_early_size(struct r8152 *tp)
>>>> {
>>>> u32 mtu = tp->netdev->mtu;
>>>> u32 ocp_data = (agg_buf_sz - mtu - VLAN_ETH_HLEN - VLAN_HLEN) / 4;
>>>>
>>>> ocp_write_word(tp, MCU_TYPE_USB, USB_RX_EARLY_SIZE, ocp_data);
>>>> }
>>
>> This only works for RTL8153. However, what you use is RTL8152.
>> It is like delay completion. It is used to reduce the loading of CPU
>> by letting a transfer contain more data to reduce the number of
>> transfers.
>>
>>> How is ocp_data used by the hardware?
>>> Shouldn't the calculation also include sizeof(rx_desc) in there somewhere?
>>
>> The algorithm is from our hw engineers, and it should be
>>
>> (agg_buf_sz - packet size) / 8
>>
>> You could refer to commit a59e6d815226 ("r8152: correct the rx early size").
>
> Thanks.
>
> Right now I am working quite hard trying to narrow things down exactly.
> You are correct that the driver does appear to be careful about accesses
> beyond the filled portion of a URB buffer -- for some reason I thought
> the original driver had issues there, but looking again it does not seem to.
>
> One idea that is now looking more likely:
> Things could be suffering from speculative CPU accesses to RAM
> (the system here has non-coherent d-cache/RAM).
> This could incorrectly pre-load data from adjacent URB buffers
> into the d-cache, creating coherency issues. I am testing now
> with cacheline-sized guard zones between the buffers to see if
> that is the issue or not.

Nope. Guard zones did not fix it, so it's probably not a prefetch issue.
Oddly, adding a couple of memory barriers to specific places in the driver
does help, A LOT. Still not 100%, but it did pass 1800 reboot tests over night
with only three bad rx_desc's reported.

That's a new record here for the driver using kmalloc'd buffers,
and put reliability on par with using non-cacheable buffers.

Any way we look at it though, the chip/driver are simply unreliable,
and relying upon hardware checksums (which fail due to the driver
looking at garbage rather than the checksum bits) leads to data corruption.

Cheers



2016-11-24 12:37:40

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord [mailto:[email protected]]
> Sent: Wednesday, November 23, 2016 9:41 PM
[...]
> >static void r8153_set_rx_early_size(struct r8152 *tp)
> >{
> > u32 mtu = tp->netdev->mtu;
> > u32 ocp_data = (agg_buf_sz - mtu - VLAN_ETH_HLEN - VLAN_HLEN) / 4;
> >
> > ocp_write_word(tp, MCU_TYPE_USB, USB_RX_EARLY_SIZE, ocp_data);
> >}
>
> How is ocp_data used by the hardware?
> Shouldn't the calculation also include sizeof(rx_desc) in there somewhere?

I check your question with our hw engineers, and you are right.
The size of rx descriptor should be calculated, too.

Best Regards,
Hayes



2016-11-24 13:27:07

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord [mailto:[email protected]]
> Sent: Thursday, November 24, 2016 8:31 PM
[...]
> Nope. Guard zones did not fix it, so it's probably not a prefetch issue.
> Oddly, adding a couple of memory barriers to specific places in the driver
> does help, A LOT. Still not 100%, but it did pass 1800 reboot tests over night
> with only three bad rx_desc's reported.
>
> That's a new record here for the driver using kmalloc'd buffers,
> and put reliability on par with using non-cacheable buffers.
>
> Any way we look at it though, the chip/driver are simply unreliable,
> and relying upon hardware checksums (which fail due to the driver
> looking at garbage rather than the checksum bits) leads to data corruption.

I don't think the garbage results from our driver or device.
If it is the issue about memory, I think the host driver ought
to deal with it, because it handles the DMA.

Besides, it doesn't seem to occur for all platforms. I have
tested the iperf more than 26 hours, and it still works fine.
I think I would get the same result on x86 or x86_64 platform.

Best Regards,
Hayes

2016-11-24 15:25:07

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-24 08:26 AM, Hayes Wang wrote:
..
> Besides, it doesn't seem to occur for all platforms. I have
> tested the iperf more than 26 hours, and it still works fine.
> I think I would get the same result on x86 or x86_64 platform.
..

x86 has near fully-coherent memory, so it is the "easy" platform
to get things working on. But Linux supports a very diverse number
of platforms, with varying degrees of cache/memory coherency,
and it can be tricky for things to work correctly on all of them.

If you are testing with the driver as currently in 4.4.34,
then you won't even notice when things are screwing up,
because the driver just silently drops packets.
Or it passes them on without noticing that they have bad data.

Here (attached) is the instrumented driver I am using here now.
I suggest you use it or something similar when testing,
and not the stock driver.

This one has also been converted to use non-cacheable RAM for the
receive buffers -- something that is probably a Good Thing
for it to do regardless of this investigation.

It also never drops a packet without logging the event,
so we can see just how often there's an issue.

This version behaves almost perfectly here, but I am still experimenting
to see what is actually necessary, and what is not. In particular,
there are some mb() calls I had put in there that shouldn't be required,
so I have yet to try removing them again and see what changes.

It takes at least an overnight run to pop up one or two errors,
so do expect to hear back again until after the weekend at this point.

Also, unrelated, but inside r8152_submit_rx() there is this code:

/* The rx would be stopped, so skip submitting */
if (test_bit(RTL8152_UNPLUG, &tp->flags) ||
!test_bit(WORK_ENABLE, &tp->flags) || !netif_carrier_ok(tp->netdev))
return 0;

If that "return 0" statement is ever executed, doesn't it result
in the loss/leak of a buffer?

Thanks



Attachments:
r8152.c (100.17 kB)

2016-11-24 16:22:06

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

From: Hayes Wang <[email protected]>
Date: Thu, 24 Nov 2016 13:26:55 +0000

> I don't think the garbage results from our driver or device.

This is my impression with what has been presented so far as well.

2016-11-24 16:27:06

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

From: Mark Lord <[email protected]>
Date: Thu, 24 Nov 2016 07:31:17 -0500

> Any way we look at it though, the chip/driver are simply unreliable,
> and relying upon hardware checksums (which fail due to the driver
> looking at garbage rather than the checksum bits) leads to data
> corruption.

If the cpu/DMA implementation is the problem, then turning off
checksums is not an appropriate fix at all.

In fact, we have no idea what the cause is yet.

That makes turning off random features no more than grasping at straws
and makes no sense at all upstream.

It may make sense for you to do such a change locally in _your_ tree
to fix your situation temporarily. But upstream we shouldn't be doing
it.

2016-11-24 16:44:06

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-24 11:21 AM, David Miller wrote:
> From: Hayes Wang <[email protected]>
> Date: Thu, 24 Nov 2016 13:26:55 +0000
>
>> I don't think the garbage results from our driver or device.
> This is my impression with what has been presented so far as well.

It's not garbage.

The latest run with the debug code I posted here earlier just spat out this below.
Using coherent (guarded, non-cacheable) RX buffers, with mb() calls:

[ 15.199157] r8152_check_rx_desc: rx_desc looks bad.
[ 15.204270] r8152_rx_bottom: offset=0/3376 bad rx_desc
[ 15.209584] r8152_dump_rx_desc: 3d435253 3034336d 202f3a30 47524154 2f3d5445 3034336d rx_len=21075

The bad data in this case is ASCII:

"SRC=m3400:/ TARGET=/m340"

This data is what is seen in /run/mount/utab, a file that is read/written over NFS on each boot.

"SRC=m3400:/ TARGET=/m3400 ROOT=/ ATTRS=nolock,addr=192.168.8.1\n"

But how does this ASCII data end up at offset zero of the rx buffer??
Not possible -- this isn't even stale data, because only an rx_desc could
be at that offset in that buffer.

So even if this were a platform memory coherency issue, one should still
never see ASCII data at the beginning of an rx buffer. The driver NEVER
writes anything to the rx buffers. Only the USB hardware ever does.

And only the r8152 dongle/driver exhibits this issue.
Other USB dongles do not. They *might* still have such issues,
but because they use software checksums, the bad packets are caught/rejected.

The r8152 driver, without the debug/error-checking additions, would have tried
to interpret that ASCII data as an "rx_desc", and would have interpreted the
"checksum bits" therein as "valid checksum", and the packet would have passed
through the network stack, corrupting data.

This driver worked without noticeable issues in 3.12.xx.
It hasn't worked since. Because it now trusts the hardware checksums,
without first checking to see if noise-on-the-line or something else
has corrupted the data before receipt in the rx buffer.

Based on the above capture, I suspect a bug in the chip itself, which perhaps
is only manifest on a very slow CPU.

Nobody here tests with slow CPUs, but they are very prevalent in embedded space.
And very few people use USB network dongles nowadays either, as nearly all "computers"
have built-in networking. The market for USB network dongles is mostly embedded space.

Ergo.

Cheers

2016-11-24 17:00:26

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-24 11:43 AM, Mark Lord wrote:
..
> But how does this ASCII data end up at offset zero of the rx buffer??
> Not possible -- this isn't even stale data, because only an rx_desc could
> be at that offset in that buffer.

Answering my own question here, I suspect it ends up there as a result
of overrunning the previous URB. So I have updated the test copy of the driver
here now to check for that exact situation. It's running now, but could take
hours or a day for the bug to occur again.

It seems I am being overly helpful here.

Perhaps I should have just stopped with the original regression report
(driver works in 3.12.xx, fails on all newer kernels, as a result of enabling
hardware checksums).

Had I left it there, one might reasonably expect the onus to be on the driver
developer to sort it out, with me providing retests of supplied patches as need be.

But I've gone WAY BEYOND that, even questioning the sanity of the platform on
which it is being used, just to avoid blaming a buggy USB dongle for some other issue.
And this is leading people to suspect that I really think the platform is buggy.

It isn't. It's been running for years, with a variety of USB hardware attached,
and nary a problem. Except with this r8152 dongle on kernels > 3.12.

So, yeah, the driver is fixed in our local tree, and has been for some time now.
I just was hoping that perhaps others might be interested in it too,
since the bug (whatever it is) corrupts data on the NFS server.

Cheers


2016-11-24 17:12:07

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

From: Mark Lord <[email protected]>
Date: Thu, 24 Nov 2016 11:43:53 -0500

> So even if this were a platform memory coherency issue, one should
> still never see ASCII data at the beginning of an rx buffer.

I'm not so convinced, since this is the kind of random corruption one
would expect to see when dealing with virtual caches that have
aliasing or similar issues.

Writes to address X that show up at address Y or not at all are
precisely the signature of virtual cache aliasing problems.

Is it a case of the chip writing to X but the cpu is still seeing
stale data from a previous CPU store?

For NFS the cpu is writing into the page cache, so we know that
cpu side stores are where the ASCII text is coming from.

Now is the r8152 buffer one that the USB host controller is DMA'ing
into directly, or is it one that SWIOMMU or similar bounce buffering
is copying into? In the latter case we are doing cpu stores into
the area and the writes aren't coming from the device.

2016-11-24 17:14:06

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

From: Mark Lord <[email protected]>
Date: Thu, 24 Nov 2016 12:00:15 -0500

> It seems I am being overly helpful here.

Either you want to cry or you want to keep helping us track down
this problem. It is your choice, and your choice alone.

Please do not pretend otherwise, everyone else in this thread is
operating with the best intentions and wants to see this through
to a full analysis and a proper solution for the corruptions.

Thank you.

2016-11-24 18:34:13

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-24 12:11 PM, David Miller wrote:
> From: Mark Lord <[email protected]>
> Date: Thu, 24 Nov 2016 11:43:53 -0500
>
>> So even if this were a platform memory coherency issue, one should
>> still never see ASCII data at the beginning of an rx buffer.
>
> I'm not so convinced, since this is the kind of random corruption one
> would expect to see when dealing with virtual caches that have
> aliasing or similar issues.
>
> Writes to address X that show up at address Y or not at all are
> precisely the signature of virtual cache aliasing problems.
>
> Is it a case of the chip writing to X but the cpu is still seeing
> stale data from a previous CPU store?
>
> For NFS the cpu is writing into the page cache, so we know that
> cpu side stores are where the ASCII text is coming from.
>
> Now is the r8152 buffer one that the USB host controller is DMA'ing
> into directly, or is it one that SWIOMMU or similar bounce buffering
> is copying into? In the latter case we are doing cpu stores into
> the area and the writes aren't coming from the device.

>From tracing through the powerpc arch code, this is the buffer that
is being directly DMA'd into. And the USB layer does an invalidate_dcache
on that entire buffer before initiating the DMA (confirmed via printk).

The driver itself NEVER writes anything to that buffer,
and nobody else has a pointer to it other than the USB host controller,
so there's nothing else that can write to it either.

According to the driver writer, the chip should only ever write a fresh
rx_desc struct at the beginning of a buffer, never ASCII data.

So how does that buffer end up containing ASCII data from the NFS transfers?

The only explanation I can see, is if the URB itself contains
the data that we see in the URB buffer. Which is what one would expect.
So for that to happen, the ethernet chip must be transferring that data.

The thing that is special about the situation here, is that the processor
is very slow (800Mhz 32-bit powerpc), and very busy elsewhere.
So it can easily fall way behind in servicing the ethernet dongle,
something that never happens with most modern faster machines.
So perhaps this results in a FIFO overflow somewhere in the chip.

We can boot/run this same machine from a USB memory stick, and nary a problem.
Ditto for other types of ethernet dongles.
But boot/run from that specific ethernet dongle, and we get regular
random segfaults from corrupted page fetches over NFS.

The only end-to-end data integrity available here is the rx checksum,
when verified by software rather than trusting it to the chip/driver.

One thought: bulk data streams are byte streams, not packets.
Scheduling on the USB bus can break up larger transfers across
multiple in-kernel buffers. A "real" URB buffer on USB2 is max 512 bytes.
The driver is providing 16384-byte buffers, and assumes that data will
never spill over from one such buffer to the next.
Yet the observations here consistently show otherwise.

Cheers
--
Mark Lord

2016-11-24 18:42:47

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On Thu, Nov 24, 2016 at 11:43:53AM -0500, Mark Lord wrote:
> On 16-11-24 11:21 AM, David Miller wrote:
> > From: Hayes Wang <[email protected]>
> > Date: Thu, 24 Nov 2016 13:26:55 +0000
> >
> > > I don't think the garbage results from our driver or device.
> > This is my impression with what has been presented so far as well.
>
> It's not garbage.
>
> The latest run with the debug code I posted here earlier just spat out this below.
> Using coherent (guarded, non-cacheable) RX buffers, with mb() calls:
>
> [ 15.199157] r8152_check_rx_desc: rx_desc looks bad.
> [ 15.204270] r8152_rx_bottom: offset=0/3376 bad rx_desc
> [ 15.209584] r8152_dump_rx_desc: 3d435253 3034336d 202f3a30 47524154 2f3d5445 3034336d rx_len=21075
>
> The bad data in this case is ASCII:
>
> "SRC=m3400:/ TARGET=/m340"

Have you tried using usbmon? Details for how to use it is in
Documentation/usbmon.txt and it might help you rule out the driver vs.
the USB host controller issues as it sees the raw data the USB host
controller sees before it sends it to the driver.

thanks,

greg k-h

2016-11-24 18:49:53

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-24 01:34 PM, Mark Lord wrote:
>From tracing through the powerpc arch code, this is the buffer that
> is being directly DMA'd into. And the USB layer does an invalidate_dcache
> on that entire buffer before initiating the DMA (confirmed via printk).

Slight correction: the invalidate_dcache_range() is only done when
using kmalloc'd buffers. I have converted the driver here
to use usb_alloc_coherent() instead, so that now gets skipped
since the memory is never cached.

2016-11-24 18:58:39

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-24 01:42 PM, Greg KH wrote:
>
> Have you tried using usbmon?

This system is running rootfs over NFS, so usbmon
isn't realistically going to be usable in that scenario
without a lot of reconfiguration of the setup (which in itself
might obscure the original problem).

There is a hardware USB analyzer in the building though.

But it requires a MS-Windows machine (very scarce here, I don't have one)
for the incredibly user-unfriendly software. I'm not sure if it can be
setup to stop the trace somehow at the right point either, as it takes
overnight runs usually to catch an occurrence of the issue.

I also seem to recall that it only exports data captures in a proprietary
format that only that brand of software/device can read, but perhaps
that might not be true. Would still need to find a MS-Windows machine/license
to even check it out though.

2016-11-24 19:00:49

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On Thu, Nov 24, 2016 at 01:34:08PM -0500, Mark Lord wrote:
> One thought: bulk data streams are byte streams, not packets.
> Scheduling on the USB bus can break up larger transfers across
> multiple in-kernel buffers. A "real" URB buffer on USB2 is max 512 bytes.
> The driver is providing 16384-byte buffers, and assumes that data will
> never spill over from one such buffer to the next.
> Yet the observations here consistently show otherwise.

Wait, how do you know that data will not spill over? What is making
that guarantee? Will the USB device send a "zero packet" in order to
show that all of the "logical" data is now sent for this specific
endpoint? Is there some sort of "framing" that the device does with the
USB data so that the driver "knows" where the end of packet is?

Check the zero-packet stuff for this device, that's tripped up many a
USB driver writer over the years, myself included.

thanks,

greg k-h

2016-11-24 19:10:42

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-24 02:00 PM, Greg KH wrote:
> On Thu, Nov 24, 2016 at 01:34:08PM -0500, Mark Lord wrote:
>> One thought: bulk data streams are byte streams, not packets.
>> Scheduling on the USB bus can break up larger transfers across
>> multiple in-kernel buffers. A "real" URB buffer on USB2 is max 512 bytes.
>> The driver is providing 16384-byte buffers, and assumes that data will
>> never spill over from one such buffer to the next.
>> Yet the observations here consistently show otherwise.
>
> Wait, how do you know that data will not spill over? What is making
> that guarantee? Will the USB device send a "zero packet" in order to
> show that all of the "logical" data is now sent for this specific
> endpoint? Is there some sort of "framing" that the device does with the
> USB data so that the driver "knows" where the end of packet is?

Exactly my point.

> Check the zero-packet stuff for this device, that's tripped up many a
> USB driver writer over the years, myself included.

I haven't tripped over it myself, but only because we were careful
to allow for such in the USB drivers I have worked on.

The r8152 driver just assumes it never happens.

2016-11-24 19:17:29

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On Thu, Nov 24, 2016 at 02:10:36PM -0500, Mark Lord wrote:
> On 16-11-24 02:00 PM, Greg KH wrote:
> > On Thu, Nov 24, 2016 at 01:34:08PM -0500, Mark Lord wrote:
> >> One thought: bulk data streams are byte streams, not packets.
> >> Scheduling on the USB bus can break up larger transfers across
> >> multiple in-kernel buffers. A "real" URB buffer on USB2 is max 512 bytes.
> >> The driver is providing 16384-byte buffers, and assumes that data will
> >> never spill over from one such buffer to the next.
> >> Yet the observations here consistently show otherwise.
> >
> > Wait, how do you know that data will not spill over? What is making
> > that guarantee? Will the USB device send a "zero packet" in order to
> > show that all of the "logical" data is now sent for this specific
> > endpoint? Is there some sort of "framing" that the device does with the
> > USB data so that the driver "knows" where the end of packet is?
>
> Exactly my point.
>
> > Check the zero-packet stuff for this device, that's tripped up many a
> > USB driver writer over the years, myself included.
>
> I haven't tripped over it myself, but only because we were careful
> to allow for such in the USB drivers I have worked on.
>
> The r8152 driver just assumes it never happens.

Assumes what? That the host will always consume data faster than the
device can create it? If so, that sounds like your real problem
there...

good luck!

greg k-h

2016-11-25 00:27:41

by Francois Romieu

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord <[email protected]> :
[...]
> >From tracing through the powerpc arch code, this is the buffer that
> is being directly DMA'd into. And the USB layer does an invalidate_dcache
> on that entire buffer before initiating the DMA (confirmed via printk).
>
> The driver itself NEVER writes anything to that buffer,
> and nobody else has a pointer to it other than the USB host controller,
> so there's nothing else that can write to it either.
>
> According to the driver writer, the chip should only ever write a fresh
> rx_desc struct at the beginning of a buffer, never ASCII data.
>
> So how does that buffer end up containing ASCII data from the NFS transfers?

Through aliasing the URB was given a page that contains said (previously)
received file. The ethernet chip/usb host does not write anything in it.
There could be a device or a driver problem but it may not be the real
problem.

So far the analysis focused on "how was this corrupted content written into
this receive buffer page ?". If I read David correctly (?) the "nobody
else has a pointer to it other than the USB host controller" point may be
replaced with "the pointer to it aliases some already used page".

--
Ueimor

2016-11-25 03:49:43

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-24 07:27 PM, Francois Romieu wrote:
>
> Through aliasing the URB was given a page that contains said (previously)
> received file. The ethernet chip/usb host does not write anything in it.

I don't see how that could be possible. Please elaborate.

The URB buffers are statically allocated by the driver at probe time,
ten of them in all, allocated with usb_alloc_coherent() in the copy of
the driver I am testing with.

There is no possibility for them to be used for anything other than
USB receive buffers, for this driver only. Nothing in the driver
or kernel ever writes to those buffers after initial allocation,
and only the driver and USB host controller ever have pointers to the buffers.
--
Mark Lord

2016-11-25 06:12:14

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord [mailto:[email protected]]
> Sent: Thursday, November 24, 2016 11:25 PM
[...]
> x86 has near fully-coherent memory, so it is the "easy" platform
> to get things working on. But Linux supports a very diverse number
> of platforms, with varying degrees of cache/memory coherency,
> and it can be tricky for things to work correctly on all of them.

However, I have test iperf on raspberry pi v1 which you suggest
for more than one day. I still couldn't reproduce your issue.

> If you are testing with the driver as currently in 4.4.34,
> then you won't even notice when things are screwing up,
> because the driver just silently drops packets.
> Or it passes them on without noticing that they have bad data.

I only drop the packet silently when the rx descriptor outside
the urb buffer. Then, I check the rx descriptor before checking
the length of the packet.

> Here (attached) is the instrumented driver I am using here now.
> I suggest you use it or something similar when testing,
> and not the stock driver.

I would test it again with your driver.

[...]
> Also, unrelated, but inside r8152_submit_rx() there is this code:
>
> /* The rx would be stopped, so skip submitting */
> if (test_bit(RTL8152_UNPLUG, &tp->flags) ||
> !test_bit(WORK_ENABLE, &tp->flags)
> || !netif_carrier_ok(tp->netdev))
> return 0;
>
> If that "return 0" statement is ever executed, doesn't it result
> in the loss/leak of a buffer?

They would be found back by calling rtl_start_rx(), when the rx
is restarted.

Best Regards,
Hayes

2016-11-25 06:32:23

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord [mailto:[email protected]]
> Sent: Friday, November 25, 2016 12:44 AM
[...]
> The bad data in this case is ASCII:
>
> "SRC=m3400:/ TARGET=/m340"
>
> This data is what is seen in /run/mount/utab, a file that is read/written over NFS on
> each boot.
>
> "SRC=m3400:/ TARGET=/m3400 ROOT=/
> ATTRS=nolock,addr=192.168.8.1\n"
>
> But how does this ASCII data end up at offset zero of the rx buffer??
> Not possible -- this isn't even stale data, because only an rx_desc could
> be at that offset in that buffer.
>
> So even if this were a platform memory coherency issue, one should still
> never see ASCII data at the beginning of an rx buffer. The driver NEVER
> writes anything to the rx buffers. Only the USB hardware ever does.
>
> And only the r8152 dongle/driver exhibits this issue.
> Other USB dongles do not. They *might* still have such issues,
> but because they use software checksums, the bad packets are caught/rejected.

Do you test it by rebooting? Maybe you could try a patch
commit 93fe9b183840 ("r8152: reset the bmu"). However, it should
only occur for the first urb buffer after rx is reset. I don't
think you would reset the rx frequently, so the situation seems
to be different.

Best Regards,
Hayes

2016-11-25 06:51:44

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

> Mark Lord [mailto:[email protected]]
> > Sent: Friday, November 25, 2016 12:44 AM
> [...]
> > The bad data in this case is ASCII:
> >
> > "SRC=m3400:/ TARGET=/m340"
> >
> > This data is what is seen in /run/mount/utab, a file that is read/written over NFS
> on
> > each boot.
> >
> > "SRC=m3400:/ TARGET=/m3400 ROOT=/
> > ATTRS=nolock,addr=192.168.8.1\n"
> >
> > But how does this ASCII data end up at offset zero of the rx buffer??
> > Not possible -- this isn't even stale data, because only an rx_desc could
> > be at that offset in that buffer.
> >
> > So even if this were a platform memory coherency issue, one should still
> > never see ASCII data at the beginning of an rx buffer. The driver NEVER
> > writes anything to the rx buffers. Only the USB hardware ever does.
> >
> > And only the r8152 dongle/driver exhibits this issue.
> > Other USB dongles do not. They *might* still have such issues,
> > but because they use software checksums, the bad packets are caught/rejected.
>
> Do you test it by rebooting? Maybe you could try a patch
> commit 93fe9b183840 ("r8152: reset the bmu"). However, it should
> only occur for the first urb buffer after rx is reset. I don't
> think you would reset the rx frequently, so the situation seems
> to be different.

Forgive me. I provide wrong information. This is about RTL8153,
not RTL8152.

Best Regards,
Hayes

2016-11-25 09:52:50

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord [mailto:[email protected]]
> Sent: Friday, November 25, 2016 3:11 AM
[...]
> On 16-11-24 02:00 PM, Greg KH wrote:
> > On Thu, Nov 24, 2016 at 01:34:08PM -0500, Mark Lord wrote:
> >> One thought: bulk data streams are byte streams, not packets.
> >> Scheduling on the USB bus can break up larger transfers across
> >> multiple in-kernel buffers. A "real" URB buffer on USB2 is max 512 bytes.
> >> The driver is providing 16384-byte buffers, and assumes that data will
> >> never spill over from one such buffer to the next.
> >> Yet the observations here consistently show otherwise.
> >
> > Wait, how do you know that data will not spill over? What is making
> > that guarantee? Will the USB device send a "zero packet" in order to
> > show that all of the "logical" data is now sent for this specific
> > endpoint? Is there some sort of "framing" that the device does with the
> > USB data so that the driver "knows" where the end of packet is?
>
> Exactly my point.
>
> > Check the zero-packet stuff for this device, that's tripped up many a
> > USB driver writer over the years, myself included.
>
> I haven't tripped over it myself, but only because we were careful
> to allow for such in the USB drivers I have worked on.
>
> The r8152 driver just assumes it never happens.

What is the value of /sys/bus/usb/devices/.../power/control ?
Could you make sure it is "on" and try again?
Or you could call usb_disable_autosuspend() in probe().

Best Regards,
Hayes



2016-11-25 09:53:55

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On Thu, Nov 24, 2016 at 10:49:33PM -0500, Mark Lord wrote:
> There is no possibility for them to be used for anything other than
> USB receive buffers, for this driver only. Nothing in the driver
> or kernel ever writes to those buffers after initial allocation,
> and only the driver and USB host controller ever have pointers to the buffers.

You really are going to have to break out that USB monitor to verify
that this is the data coming across the wire. Note, there are "cheap"
USB monitors that can be quite handy and that work on Linux:
http://www.totalphase.com/products/beagle-usb12/

Or most high-end scopes have a USB mode that you can use to catch stuff
like this (but they are usually harder to use/trigger and only store a
very limited buffer).

good luck!

greg k-h

2016-11-25 12:42:17

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-25 01:11 AM, Hayes Wang wrote:
> Mark Lord [mailto:[email protected]]
..
>> If that "return 0" statement is ever executed, doesn't it result
>> in the loss/leak of a buffer?
>
> They would be found back by calling rtl_start_rx(), when the rx
> is restarted.

Good. I figured it was probably something like that, but wasn't
entirely sure about the control flow around stop/restart there.

Thanks.

2016-11-25 12:42:38

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-25 01:51 AM, Hayes Wang wrote:
>
> Forgive me. I provide wrong information. This is about RTL8153, not RTL8152.

No problem. Thanks for trying though.

2016-11-25 12:42:47

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-25 07:34 AM, Mark Lord wrote:
> On 16-11-25 04:53 AM, Greg KH wrote:
>> Note, there are "cheap" USB monitors that can be quite handy and that work on Linux:
>> http://www.totalphase.com/products/beagle-usb12/
>
> USD$455/each in quantity, vs. USD$8 for the USB ethernet dongle.

Oh, wrong model. That one doesn't do USB2.
The USB2 version is a mere USD$1300 in quantity.

Seems like rather a lot of money just to report a bug in a USB driver.
Perhaps the Linux Foundation might purchase one and loan it for this task?

2016-11-25 12:47:45

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-25 04:53 AM, Greg KH wrote:
> Note, there are "cheap" USB monitors that can be quite handy and that work on Linux:
> http://www.totalphase.com/products/beagle-usb12/

USD$455/each in quantity, vs. USD$8 for the USB ethernet dongle.

2016-11-25 12:50:32

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-25 04:53 AM, Greg KH wrote:
> On Thu, Nov 24, 2016 at 10:49:33PM -0500, Mark Lord wrote:
>> There is no possibility for them to be used for anything other than
>> USB receive buffers, for this driver only. Nothing in the driver
>> or kernel ever writes to those buffers after initial allocation,
>> and only the driver and USB host controller ever have pointers to the buffers.
>
> You really are going to have to break out that USB monitor to verify
> that this is the data coming across the wire.

Not sure why, because there really is no other way for the data to
appear where it does at the beginning of that URB buffer.

This does seem a rather unexpected burden to place upon someone
reporting a regression in a USB network driver that corrupts user data.

I have already spent about 50 hours looking at this issue,
and everything now points firmly at some kind of FIFO overflow
within the dongle itself. There is no evidence to the contrary.

I am very happy to test any driver updates, or data collection mods
provided by the author, to help the author find/fix the issue.

One idea, might be to have the author try testing with the dongle
connected through a USB1.1 hub, forcing it to slower speeds.
This might make reproducing the issue (if indeed a FIFO overflow)
easier, as the host transfers will then be slower than the
ethernet wire speed.

I have access to the hardware here next Tuesday.
If we can scrounge up the USB analyzer, cables, and a suitable
MS-Windows (ugh) machine of some kind, then I'll see if it can
be programmed to somewhow capture the event. Probably just set it
in continuous capture mode, and have the target system halt
when it sees bad data at offset zero.

This can take days to reproduce, so don't hold your breaths.

Something useful to do in the meanwhile, is to then think
about "what next" after the analyzer confirms the issue.
--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2016-11-25 13:32:35

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-25 04:52 AM, Hayes Wang wrote:
..
> What is the value of /sys/bus/usb/devices/.../power/control ?

That entry does not exist -- power control is completely
disabled on this board.

Good try, though -- USB power control still causes me trouble
on PCs with mice and remote controls. But not here.

2016-11-25 14:22:43

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On Fri, Nov 25, 2016 at 07:41:42AM -0500, Mark Lord wrote:
> On 16-11-25 07:34 AM, Mark Lord wrote:
> > On 16-11-25 04:53 AM, Greg KH wrote:
> >> Note, there are "cheap" USB monitors that can be quite handy and that work on Linux:
> >> http://www.totalphase.com/products/beagle-usb12/
> >
> > USD$455/each in quantity, vs. USD$8 for the USB ethernet dongle.
>
> Oh, wrong model. That one doesn't do USB2.
> The USB2 version is a mere USD$1300 in quantity.
>
> Seems like rather a lot of money just to report a bug in a USB driver.
> Perhaps the Linux Foundation might purchase one and loan it for this task?

You already have access to a USB analyzer you said, why would I try to
buy one and ship it around the world instead? Makes no sense...

greg k-h

2016-11-25 14:36:43

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On Fri, Nov 25, 2016 at 07:49:35AM -0500, Mark Lord wrote:
> On 16-11-25 04:53 AM, Greg KH wrote:
> > On Thu, Nov 24, 2016 at 10:49:33PM -0500, Mark Lord wrote:
> >> There is no possibility for them to be used for anything other than
> >> USB receive buffers, for this driver only. Nothing in the driver
> >> or kernel ever writes to those buffers after initial allocation,
> >> and only the driver and USB host controller ever have pointers to the buffers.
> >
> > You really are going to have to break out that USB monitor to verify
> > that this is the data coming across the wire.
>
> Not sure why, because there really is no other way for the data to
> appear where it does at the beginning of that URB buffer.

Broken USB host controller driver, or the device really is sending that
data to the host. It's either one or the other, and the only way you
can rule one of them out is to look at the data on the wire.

best of luck,

greg k-h

2016-11-25 14:36:56

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-11-25 09:22 AM, Greg KH wrote:
> On Fri, Nov 25, 2016 at 07:41:42AM -0500, Mark Lord wrote:
>> On 16-11-25 07:34 AM, Mark Lord wrote:
>>> On 16-11-25 04:53 AM, Greg KH wrote:
>>>> Note, there are "cheap" USB monitors that can be quite handy and that work on Linux:
>>>> http://www.totalphase.com/products/beagle-usb12/
>>>
>>> USD$455/each in quantity, vs. USD$8 for the USB ethernet dongle.
>>
>> Oh, wrong model. That one doesn't do USB2.
>> The USB2 version is a mere USD$1300 in quantity.
>>
>> Seems like rather a lot of money just to report a bug in a USB driver.
>> Perhaps the Linux Foundation might purchase one and loan it for this task?
>
> You already have access to a USB analyzer you said, why would I try to
> buy one and ship it around the world instead? Makes no sense...

No, the company where I am consulting has a paperweight called a "USB analyzer".
It doesn't work with Linux machines.

You are the one who suggested purchase of a working Linux compatible unit,
so I was just following up to see if you were serious about that.

No worries.
I'll see if the paperweight can be converted into something useful next week.

Cheers

2016-11-25 17:07:15

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

From: Mark Lord <[email protected]>
Date: Fri, 25 Nov 2016 07:49:35 -0500

> On 16-11-25 04:53 AM, Greg KH wrote:
>> On Thu, Nov 24, 2016 at 10:49:33PM -0500, Mark Lord wrote:
>>> There is no possibility for them to be used for anything other than
>>> USB receive buffers, for this driver only. Nothing in the driver
>>> or kernel ever writes to those buffers after initial allocation,
>>> and only the driver and USB host controller ever have pointers to the buffers.
>>
>> You really are going to have to break out that USB monitor to verify
>> that this is the data coming across the wire.
>
> Not sure why, because there really is no other way for the data to
> appear where it does at the beginning of that URB buffer.
>
> This does seem a rather unexpected burden to place upon someone
> reporting a regression in a USB network driver that corrupts user data.

If you are the only person who can actively reproduce this, which
seems to be the case right now, this is unfortunately the only way to
reach a proper analysis and fix.

2016-11-30 11:59:36

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord <[email protected]>
[...]
> > Not sure why, because there really is no other way for the data to
> > appear where it does at the beginning of that URB buffer.
> >
> > This does seem a rather unexpected burden to place upon someone
> > reporting a regression in a USB network driver that corrupts user data.
>
> If you are the only person who can actively reproduce this, which
> seems to be the case right now, this is unfortunately the only way to
> reach a proper analysis and fix.

I have tested it with iperf more than five days without any error.
I would think if there is any other way to reproduce it.

Best Regards,
Hayes

2016-12-09 03:24:13

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

Mark Lord <[email protected]>

I find an issue about autosuspend, and it may result in the same
problem with you. I don't sure if this is helpful to you, because
it only occurs when enabling the autosuspend.

Best Regards,
Hayes


Attachments:
r8152.c (104.92 kB)
r8152.c

2016-12-09 13:05:45

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On 16-12-08 10:23 PM, Hayes Wang wrote:
> Mark Lord <[email protected]>
>
> I find an issue about autosuspend, and it may result in the same
> problem with you. I don't sure if this is helpful to you, because
> it only occurs when enabling the autosuspend.

Thanks. I am using ASIX adapters now.

I did try the latest 4.9-rc8, and 4.8.12 kernels with the r8152 dongle yesterday,
in hope that perhaps the many EHCI fixes from those kernels might help out.

The dongle was unusable with those newer kernels.
Most of the time it failed with "Get ether addr fail\n" at startup.

On the occasions where it got past that point, it often failed
the DHCP negotiation, but this looks more like a bug elsewhere in
the kernel, possibly racing against initialization of the random
number generators. Adding a 2-second sleep the the r8151 probe
function made this error mostly go away.

Cheers
--
Mark Lord

2017-01-01 00:07:44

by Ansis Atteka

[permalink] [raw]
Subject: Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

On Wed, Nov 30, 2016 at 3:58 AM, Hayes Wang <[email protected]> wrote:
> Mark Lord <[email protected]>
> [...]
>> > Not sure why, because there really is no other way for the data to
>> > appear where it does at the beginning of that URB buffer.
>> >
>> > This does seem a rather unexpected burden to place upon someone
>> > reporting a regression in a USB network driver that corrupts user data.
>>
>> If you are the only person who can actively reproduce this, which
>> seems to be the case right now, this is unfortunately the only way to
>> reach a proper analysis and fix.
>
> I have tested it with iperf more than five days without any error.
> I would think if there is any other way to reproduce it.
>

For the past few days I have been debugging a similar data corruption
bug related to r8152 driver, but on x86-64 platform. Also, I think
that this data corruption bug has some serious security implications,
because it appears that "corrupted data" is actually 530 byte fragment
from one of the previous Ethernet frames that Realtek device just
received. See the ping test in the bottom of my email that
demonstrates this.

Besides the data corruption problem I am also experiencing another
serious problem that could be related and manifests itself in XHCI
module when Realtek Ethernet port receives packets at "high" rate (ie
10Mbps or higher). This second problem correlates with error messages
in kern.log printed by xhci-hcd. Ethernet connectivity is completely
lost at this time until I reload r8152 driver:

[ 2540.426240] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426258] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f010 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.426259] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426260] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f020 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.426334] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426336] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f030 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.426372] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426373] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f040 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.426488] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.426491] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f050 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.437020] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.437024] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f060 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.438239] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.438246] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f070 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0
[ 2540.438493] xhci_hcd 0000:0e:00.0: ERROR Transfer event TRB DMA ptr
not part of current TD ep_index 2 comp_code 13
[ 2540.438495] xhci_hcd 0000:0e:00.0: Looking for event-dma
00000000fff0f080 trb-start 00000000ff5c9fe0 trb-end 00000000ff5c9fe0
seg-start 00000000ff5c9000 seg-end 00000000ff5c9ff0


All of that is happening on my X86-64 Dell XPS15 9550 laptop that is
connected to Ethernet via Dell TB15 dock. This Dell TB 15 Dock uses
Realtek chip to provide Ethernet connectivity to laptop:

# lsusb
...
Bus 004 Device 003: ID 0bda:8153 Realtek Semiconductor Corp.
Device Descriptor:
bLength 18
bDescriptorType 1
bcdUSB 3.00
bDeviceClass 0 (Defined at Interface level)
bDeviceSubClass 0
bDeviceProtocol 0
bMaxPacketSize0 9
idVendor 0x0bda Realtek Semiconductor Corp.
idProduct 0x8153
bcdDevice 30.01
iManufacturer 1 Realtek
iProduct 2 USB 10/100/1000 LAN
iSerial 6 000001000000
bNumConfigurations 2

This Realtek Ethernet port is connected to a XHCI ASMedia host
controller that also resides on Dell TB15 Dock. The dock itself is
connected via Thunderbolt 3 cable to laptop:

# lspci
....
0e:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller


In my case it is easy to reproduce either of those two issues. Here
are my observations:
1. The Ethernet controller on Dell TB15 dock was working completely
fine while I had Windows 10 installed on my Laptop.
2. I have tried various Linux distributions - Ubuntu 16.10, Ubuntu
14.04, CentOS 7. All of them fail with "ERROR Transfer event TRB DMA
ptr not part of current TD ep_index 2 comp_code 13" error message
under high load.
3. I have tried Ubuntu 16.10 and Ubuntu 16.04. Both of them are
affected by this data corruption bug. I did not test for data
corruption on CentOS or other Linux distributions that come with older
Linux kernels than Ubuntu.
4. If I start two ping instances at the same time then it appears that
530 bytes from the first ping instance are occasionally "injected"
into ping payload of the second ping instance. Also, I was able to
reproduce this exact same issue with TCP.

sudo ping -i 0.05 -p ff -s 15000 10.33.75.80 # Sending 0xff as payload
....
15008 bytes from 10.33.75.80: icmp_seq=39 ttl=64 time=104 ms
wrong data byte #9822 should be 0xff but was 0x0
#16 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff
#9776 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#9808 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
#9840 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#9872 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#9904 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#9936 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#9968 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10032 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10064 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10096 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10160 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10192 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10224 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10288 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10320 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#10352 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#10384 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
...

sudo ping -i 0.05 -p 00 -s 15000 10.33.75.80 # Sending 0x00 as payload
...
15008 bytes from 10.33.75.80: icmp_seq=164 ttl=64 time=95.4 ms
wrong data byte #11302 should be 0x0 but was 0xff
#16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...
#11248 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#11280 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ff ff ff ff ff ff ff ff ff ff
#11312 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11344 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11376 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11408 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11472 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11504 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11536 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11568 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11600 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11632 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11664 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11696 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11728 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11760 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11792 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff
#11824 ff ff ff ff ff ff ff ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#11856 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#11888 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...

2019-01-05 14:23:28

by Mark Lord

[permalink] [raw]
Subject: r8152: data corruption in various scenarios

A couple of years back, I reported data corruption resulting from
a change in kernel 3.16 which enabled hardware checksums in the r8152 driver.
This was happening on an embedded system that was using a r8152 USB dongle.

At the time, it was very difficult to figure out what could possibly be causing it,
other than that re-enabling software checksums prevented corrupted packets from
resulting in more serious issues.

Since that time, more and more reports of similar corruption and issues
have been trickling in. Eg.

https://lore.kernel.org/patchwork/patch/873920/

Note that there are reports in the thread above that the issues
are not limited to only the built-in ethernet chip of the dock.

There is even now a special hack in the upstream r8152.c to attempt to detect
a Dell TB16 dock and disable RX Aggregation in the driver to prevent such issues.

Well.. I have a WD15 dock, not a TB16, and that same hack also catches my dock
in its net:

[5.794641] usb 4-1.2: Dell TB16 Dock, disable RX aggregation

So one issue is that the code is not correctly identifying the dock,
and the WD15 is claimed to be immune from the r8152 issues.

One of the symptoms of the r8152 issue, reported by Ansis Atteka,
were messages like this:

xhci_hcd 0000:39:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 13
comp_code 1

I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
with the TB16 "workaround" enabled in Linux kernel 4.20.0.

From this I conclude that the workaround is not 100% complete yet.
--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2019-01-05 14:24:30

by Mark Lord

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios

On 2019-01-05 9:14 a.m., Mark Lord wrote:
> A couple of years back, I reported data corruption resulting from
> a change in kernel 3.16 which enabled hardware checksums in the r8152 driver.
> This was happening on an embedded system that was using a r8152 USB dongle.
>
> At the time, it was very difficult to figure out what could possibly be causing it,
> other than that re-enabling software checksums prevented corrupted packets from
> resulting in more serious issues.
>
> Since that time, more and more reports of similar corruption and issues
> have been trickling in. Eg.
>
> https://lore.kernel.org/patchwork/patch/873920/

Forgot to include this link (below) where people still have the issue
even with the driver workaround. Switching to software checksums "fixes" it:

https://bugzilla.redhat.com/show_bug.cgi?id=1460789

>
> Note that there are reports in the thread above that the issues
> are not limited to only the built-in ethernet chip of the dock.
>
> There is even now a special hack in the upstream r8152.c to attempt to detect
> a Dell TB16 dock and disable RX Aggregation in the driver to prevent such issues.
>
> Well.. I have a WD15 dock, not a TB16, and that same hack also catches my dock
> in its net:
>
> [5.794641] usb 4-1.2: Dell TB16 Dock, disable RX aggregation
>
> So one issue is that the code is not correctly identifying the dock,
> and the WD15 is claimed to be immune from the r8152 issues.
>
> One of the symptoms of the r8152 issue, reported by Ansis Atteka,
> were messages like this:
>
> xhci_hcd 0000:39:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 13
> comp_code 1
>
> I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
> with the TB16 "workaround" enabled in Linux kernel 4.20.0.
>
>>From this I conclude that the workaround is not 100% complete yet.
>


--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2019-01-06 19:18:05

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios



> On Jan 5, 2019, at 10:14 PM, Mark Lord <[email protected]> wrote:
>
> A couple of years back, I reported data corruption resulting from
> a change in kernel 3.16 which enabled hardware checksums in the r8152 driver.
> This was happening on an embedded system that was using a r8152 USB dongle.
>
> At the time, it was very difficult to figure out what could possibly be causing it,
> other than that re-enabling software checksums prevented corrupted packets from
> resulting in more serious issues.
>
> Since that time, more and more reports of similar corruption and issues
> have been trickling in. Eg.
>
> https://lore.kernel.org/patchwork/patch/873920/
>
> Note that there are reports in the thread above that the issues
> are not limited to only the built-in ethernet chip of the dock.
>
> There is even now a special hack in the upstream r8152.c to attempt to detect
> a Dell TB16 dock and disable RX Aggregation in the driver to prevent such issues.
>
> Well.. I have a WD15 dock, not a TB16, and that same hack also catches my dock
> in its net:
>
> [5.794641] usb 4-1.2: Dell TB16 Dock, disable RX aggregation

The serial should be unique according to Dell.

>
> So one issue is that the code is not correctly identifying the dock,
> and the WD15 is claimed to be immune from the r8152 issues.

The WD15 I tested didn't use that serial number though...

>
> One of the symptoms of the r8152 issue, reported by Ansis Atteka,
> were messages like this:
>
> xhci_hcd 0000:39:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 13
> comp_code 1

This is probably an xHC bug. A similar issue is fixed by commit 9da5a1092b13
("xhci: Bad Ethernet performance plugged in ASM1042A host”).

>
> I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
> with the TB16 "workaround" enabled in Linux kernel 4.20.0.

Is the xHC WD15 connected an ASMedia one?

Kai-Heng

>
> From this I conclude that the workaround is not 100% complete yet.
> --
> Mark Lord
> Real-Time Remedies Inc.
> [email protected]


2019-01-06 21:16:52

by Mark Lord

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios

On 2019-01-06 2:14 p.m., Kai Heng Feng wrote:>> On Jan 5, 2019, at 10:14 PM, Mark Lord
<[email protected]> wrote:
..
>> There is even now a special hack in the upstream r8152.c to attempt to detect
>> a Dell TB16 dock and disable RX Aggregation in the driver to prevent such issues.
>>
>> Well.. I have a WD15 dock, not a TB16, and that same hack also catches my dock
>> in its net:
>>
>> [5.794641] usb 4-1.2: Dell TB16 Dock, disable RX aggregation
>
> The serial should be unique according to Dell.
>
>> So one issue is that the code is not correctly identifying the dock,
>> and the WD15 is claimed to be immune from the r8152 issues.
>
> The WD15 I tested didn't use that serial number though...

What info do you need from me about the WD15 so this can be corrected?

>> xhci_hcd 0000:39:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 13
>> comp_code 1
>
> This is probably an xHC bug. A similar issue is fixed by commit 9da5a1092b13
> ("xhci: Bad Ethernet performance plugged in ASM1042A host”).
>
>> I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
>> with the TB16 "workaround" enabled in Linux kernel 4.20.0.
>
> Is the xHC WD15 connected an ASMedia one?

I don't know. I *think* it identifies as a DSL6340 (see below).

Here is lspci and lsusb:

$ lspci -vt
-[0000:00]-+-00.0 Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
+-02.0 Intel Corporation UHD Graphics 620
+-04.0 Intel Corporation Skylake Processor Thermal Subsystem
+-14.0 Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
+-14.2 Intel Corporation Sunrise Point-LP Thermal subsystem
+-15.0 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
+-15.1 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
+-16.0 Intel Corporation Sunrise Point-LP CSME HECI #1
+-1c.0-[01-39]----00.0-[02-39]--+-00.0-[03]--
| +-01.0-[04-38]--
| \-02.0-[39]----00.0 Intel Corporation DSL6340 USB 3.1
Controller [Alpine Ridge]
+-1c.4-[3a]----00.0 Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter
+-1d.0-[3b]----00.0 Samsung Electronics Co Ltd Device a808
+-1f.0 Intel Corporation Device 9d4e
+-1f.2 Intel Corporation Sunrise Point-LP PMC
+-1f.3 Intel Corporation Sunrise Point-LP HD Audio
\-1f.4 Intel Corporation Sunrise Point-LP SMBus
$ lsusb -t
/: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
|__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/7p, 5000M
|__ Port 2: Dev 3, If 0, Class=Vendor Specific Class, Driver=r8152, 5000M
/: Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
|__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/7p, 480M
|__ Port 5: Dev 3, If 1, Class=Audio, Driver=snd-usb-audio, 480M
|__ Port 5: Dev 3, If 2, Class=Audio, Driver=snd-usb-audio, 480M
|__ Port 5: Dev 3, If 0, Class=Audio, Driver=snd-usb-audio, 480M
|__ Port 5: Dev 3, If 3, Class=Audio, Driver=snd-usb-audio, 480M
|__ Port 6: Dev 4, If 0, Class=Human Interface Device, Driver=usbhid, 12M
|__ Port 6: Dev 4, If 1, Class=Human Interface Device, Driver=usbhid, 12M
|__ Port 6: Dev 4, If 2, Class=Human Interface Device, Driver=usbhid, 12M
|__ Port 7: Dev 5, If 0, Class=Human Interface Device, Driver=usbhid, 1.5M
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/12p, 480M
|__ Port 3: Dev 2, If 0, Class=Wireless, Driver=btusb, 12M
|__ Port 3: Dev 2, If 1, Class=Wireless, Driver=btusb, 12M

Thanks for having a look.
--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2019-01-06 21:18:16

by Mark Lord

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios

On 2019-01-06 4:13 p.m., Mark Lord wrote:
> On 2019-01-06 2:14 p.m., Kai Heng Feng wrote:>> On Jan 5, 2019, at 10:14 PM, Mark Lord
> <[email protected]> wrote:
> ..
>>> There is even now a special hack in the upstream r8152.c to attempt to detect
>>> a Dell TB16 dock and disable RX Aggregation in the driver to prevent such issues.
>>>
>>> Well.. I have a WD15 dock, not a TB16, and that same hack also catches my dock
>>> in its net:
>>>
>>> [5.794641] usb 4-1.2: Dell TB16 Dock, disable RX aggregation
>>
>> The serial should be unique according to Dell.
>>
>>> So one issue is that the code is not correctly identifying the dock,
>>> and the WD15 is claimed to be immune from the r8152 issues.
>>
>> The WD15 I tested didn't use that serial number though...
>
> What info do you need from me about the WD15 so this can be corrected?
>
>>> xhci_hcd 0000:39:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 13
>>> comp_code 1
>>
>> This is probably an xHC bug. A similar issue is fixed by commit 9da5a1092b13
>> ("xhci: Bad Ethernet performance plugged in ASM1042A host”).
>>
>>> I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
>>> with the TB16 "workaround" enabled in Linux kernel 4.20.0.
>>
>> Is the xHC WD15 connected an ASMedia one?
>
> I don't know. I *think* it identifies as a DSL6340 (see below).
>
> Here is lspci and lsusb:
>
> $ lspci -vt
> -[0000:00]-+-00.0 Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
> +-02.0 Intel Corporation UHD Graphics 620
> +-04.0 Intel Corporation Skylake Processor Thermal Subsystem
> +-14.0 Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
> +-14.2 Intel Corporation Sunrise Point-LP Thermal subsystem
> +-15.0 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
> +-15.1 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
> +-16.0 Intel Corporation Sunrise Point-LP CSME HECI #1
> +-1c.0-[01-39]----00.0-[02-39]--+-00.0-[03]--
> | +-01.0-[04-38]--
> | \-02.0-[39]----00.0 Intel Corporation DSL6340 USB 3.1
> Controller [Alpine Ridge]
> +-1c.4-[3a]----00.0 Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter
> +-1d.0-[3b]----00.0 Samsung Electronics Co Ltd Device a808
> +-1f.0 Intel Corporation Device 9d4e
> +-1f.2 Intel Corporation Sunrise Point-LP PMC
> +-1f.3 Intel Corporation Sunrise Point-LP HD Audio
> \-1f.4 Intel Corporation Sunrise Point-LP SMBus


Mmm.. lspci -vt isn't as verbose as I thought, so here is plain lspci to fill in the blanks:

$ lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM
Registers (rev 08)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (rev 07)

00:04.0 Signal processing controller: Intel Corporation Skylake Processor Thermal Subsystem (rev 08)

00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)

00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)

00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
(rev 21)
00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
(rev 21)
00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)

00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port (rev f1)

00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)

00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)

00:1f.0 ISA bridge: Intel Corporation Device 9d4e (rev 21)

00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)

00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)

00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)

01:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]

02:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]

02:01.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]

02:02.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]

39:00.0 USB controller: Intel Corporation DSL6340 USB 3.1 Controller [Alpine Ridge]

3a:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)

3b:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a808


--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2019-01-07 03:56:53

by Hayes Wang

[permalink] [raw]
Subject: RE: r8152: data corruption in various scenarios

Monday, January 07, 2019 5:17 AM
[...]
>> This is probably an xHC bug. A similar issue is fixed by commit 9da5a1092b13
>> ("xhci: Bad Ethernet performance plugged in ASM1042A host”).
>>
>>> I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
>>> with the TB16 "workaround" enabled in Linux kernel 4.20.0.
>>
>> Is the xHC WD15 connected an ASMedia one?
>
> I don't know. I *think* it identifies as a DSL6340 (see below).
>

According to our record, it is relative to the asmedia.

Best Regards,
Hayes


2019-01-07 04:12:05

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios



> On Jan 7, 2019, at 05:16, Mark Lord <[email protected]> wrote:
>
> On 2019-01-06 4:13 p.m., Mark Lord wrote:
>> On 2019-01-06 2:14 p.m., Kai Heng Feng wrote:>> On Jan 5, 2019, at 10:14 PM, Mark Lord
>> <[email protected]> wrote:
>> ..
>>>> There is even now a special hack in the upstream r8152.c to attempt to detect
>>>> a Dell TB16 dock and disable RX Aggregation in the driver to prevent such issues.
>>>>
>>>> Well.. I have a WD15 dock, not a TB16, and that same hack also catches my dock
>>>> in its net:
>>>>
>>>> [5.794641] usb 4-1.2: Dell TB16 Dock, disable RX aggregation
>>>
>>> The serial should be unique according to Dell.
>>>
>>>> So one issue is that the code is not correctly identifying the dock,
>>>> and the WD15 is claimed to be immune from the r8152 issues.
>>>
>>> The WD15 I tested didn't use that serial number though...
>>
>> What info do you need from me about the WD15 so this can be corrected?
>>
>>>> xhci_hcd 0000:39:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 13
>>>> comp_code 1
>>>
>>> This is probably an xHC bug. A similar issue is fixed by commit 9da5a1092b13
>>> ("xhci: Bad Ethernet performance plugged in ASM1042A host”).
>>>
>>>> I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
>>>> with the TB16 "workaround" enabled in Linux kernel 4.20.0.
>>>
>>> Is the xHC WD15 connected an ASMedia one?
>>
>> I don't know. I *think* it identifies as a DSL6340 (see below).
>>
>> Here is lspci and lsusb:
>>
>> $ lspci -vt
>> -[0000:00]-+-00.0 Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
>> +-02.0 Intel Corporation UHD Graphics 620
>> +-04.0 Intel Corporation Skylake Processor Thermal Subsystem
>> +-14.0 Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
>> +-14.2 Intel Corporation Sunrise Point-LP Thermal subsystem
>> +-15.0 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
>> +-15.1 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
>> +-16.0 Intel Corporation Sunrise Point-LP CSME HECI #1
>> +-1c.0-[01-39]----00.0-[02-39]--+-00.0-[03]--
>> | +-01.0-[04-38]--
>> | \-02.0-[39]----00.0 Intel Corporation DSL6340 USB 3.1
>> Controller [Alpine Ridge]
>> +-1c.4-[3a]----00.0 Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter
>> +-1d.0-[3b]----00.0 Samsung Electronics Co Ltd Device a808
>> +-1f.0 Intel Corporation Device 9d4e
>> +-1f.2 Intel Corporation Sunrise Point-LP PMC
>> +-1f.3 Intel Corporation Sunrise Point-LP HD Audio
>> \-1f.4 Intel Corporation Sunrise Point-LP SMBus
>
>
> Mmm.. lspci -vt isn't as verbose as I thought, so here is plain lspci to fill in the blanks:
>
> $ lspci
> 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM
> Registers (rev 08)
> 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (rev 07)
>
> 00:04.0 Signal processing controller: Intel Corporation Skylake Processor Thermal Subsystem (rev 08)
>
> 00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
>
> 00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
>
> 00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
> (rev 21)
> 00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
> (rev 21)
> 00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
>
> 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port (rev f1)
>
> 00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
>
> 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)
>
> 00:1f.0 ISA bridge: Intel Corporation Device 9d4e (rev 21)
>
> 00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
>
> 00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
>
> 00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
>
> 01:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>
> 02:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>
> 02:01.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>
> 02:02.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>
> 39:00.0 USB controller: Intel Corporation DSL6340 USB 3.1 Controller [Alpine Ridge]

So it’s not an ASMedia one.

Before digging further, please make sure the system firmware (BIOS), Thunderbolt controller NVM and WD15 firmware are all up-to-date.

Kai-Heng

>
> 3a:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
>
> 3b:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a808
>
>
> --
> Mark Lord
> Real-Time Remedies Inc.
> [email protected]


2019-01-07 04:15:31

by Mark Lord

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios

On 2019-01-06 11:09 p.m., Kai Heng Feng wrote:
>
>
>> On Jan 7, 2019, at 05:16, Mark Lord <[email protected]> wrote:
>>
>> On 2019-01-06 4:13 p.m., Mark Lord wrote:
>>> On 2019-01-06 2:14 p.m., Kai Heng Feng wrote:>> On Jan 5, 2019, at 10:14 PM, Mark Lord
>>> <[email protected]> wrote:
>>> ..
>>>>> There is even now a special hack in the upstream r8152.c to attempt to detect
>>>>> a Dell TB16 dock and disable RX Aggregation in the driver to prevent such issues.
>>>>>
>>>>> Well.. I have a WD15 dock, not a TB16, and that same hack also catches my dock
>>>>> in its net:
>>>>>
>>>>> [5.794641] usb 4-1.2: Dell TB16 Dock, disable RX aggregation
>>>>
>>>> The serial should be unique according to Dell.
>>>>
>>>>> So one issue is that the code is not correctly identifying the dock,
>>>>> and the WD15 is claimed to be immune from the r8152 issues.
>>>>
>>>> The WD15 I tested didn't use that serial number though...
>>>
>>> What info do you need from me about the WD15 so this can be corrected?
>>>
>>>>> xhci_hcd 0000:39:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 13
>>>>> comp_code 1
>>>>
>>>> This is probably an xHC bug. A similar issue is fixed by commit 9da5a1092b13
>>>> ("xhci: Bad Ethernet performance plugged in ASM1042A host”).
>>>>
>>>>> I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
>>>>> with the TB16 "workaround" enabled in Linux kernel 4.20.0.
>>>>
>>>> Is the xHC WD15 connected an ASMedia one?
>>>
>>> I don't know. I *think* it identifies as a DSL6340 (see below).
>>>
>>> Here is lspci and lsusb:
>>>
>>> $ lspci -vt
>>> -[0000:00]-+-00.0 Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
>>> +-02.0 Intel Corporation UHD Graphics 620
>>> +-04.0 Intel Corporation Skylake Processor Thermal Subsystem
>>> +-14.0 Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
>>> +-14.2 Intel Corporation Sunrise Point-LP Thermal subsystem
>>> +-15.0 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
>>> +-15.1 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
>>> +-16.0 Intel Corporation Sunrise Point-LP CSME HECI #1
>>> +-1c.0-[01-39]----00.0-[02-39]--+-00.0-[03]--
>>> | +-01.0-[04-38]--
>>> | \-02.0-[39]----00.0 Intel Corporation DSL6340 USB 3.1
>>> Controller [Alpine Ridge]
>>> +-1c.4-[3a]----00.0 Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter
>>> +-1d.0-[3b]----00.0 Samsung Electronics Co Ltd Device a808
>>> +-1f.0 Intel Corporation Device 9d4e
>>> +-1f.2 Intel Corporation Sunrise Point-LP PMC
>>> +-1f.3 Intel Corporation Sunrise Point-LP HD Audio
>>> \-1f.4 Intel Corporation Sunrise Point-LP SMBus
>>
>>
>> Mmm.. lspci -vt isn't as verbose as I thought, so here is plain lspci to fill in the blanks:
>>
>> $ lspci
>> 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM
>> Registers (rev 08)
>> 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (rev 07)
>>
>> 00:04.0 Signal processing controller: Intel Corporation Skylake Processor Thermal Subsystem (rev 08)
>>
>> 00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
>>
>> 00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
>>
>> 00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
>> (rev 21)
>> 00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
>> (rev 21)
>> 00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
>>
>> 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port (rev f1)
>>
>> 00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
>>
>> 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)
>>
>> 00:1f.0 ISA bridge: Intel Corporation Device 9d4e (rev 21)
>>
>> 00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
>>
>> 00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
>>
>> 00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
>>
>> 01:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>>
>> 02:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>>
>> 02:01.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>>
>> 02:02.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>>
>> 39:00.0 USB controller: Intel Corporation DSL6340 USB 3.1 Controller [Alpine Ridge]
>
> So it’s not an ASMedia one.
>
> Before digging further, please make sure the system firmware (BIOS), Thunderbolt controller NVM and WD15 firmware are all up-to-date.

Everything is completely up to date.


--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2019-01-07 06:49:09

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios



> On Jan 7, 2019, at 12:13, Mark Lord <[email protected]> wrote:
>
> On 2019-01-06 11:09 p.m., Kai Heng Feng wrote:
>>
>>
>>> On Jan 7, 2019, at 05:16, Mark Lord <[email protected]> wrote:
>>>
>>> On 2019-01-06 4:13 p.m., Mark Lord wrote:
>>>> On 2019-01-06 2:14 p.m., Kai Heng Feng wrote:>> On Jan 5, 2019, at 10:14 PM, Mark Lord
>>>> <[email protected]> wrote:
>>>> ..
>>>>>> There is even now a special hack in the upstream r8152.c to attempt to detect
>>>>>> a Dell TB16 dock and disable RX Aggregation in the driver to prevent such issues.
>>>>>>
>>>>>> Well.. I have a WD15 dock, not a TB16, and that same hack also catches my dock
>>>>>> in its net:
>>>>>>
>>>>>> [5.794641] usb 4-1.2: Dell TB16 Dock, disable RX aggregation
>>>>>
>>>>> The serial should be unique according to Dell.
>>>>>
>>>>>> So one issue is that the code is not correctly identifying the dock,
>>>>>> and the WD15 is claimed to be immune from the r8152 issues.
>>>>>
>>>>> The WD15 I tested didn't use that serial number though...
>>>>
>>>> What info do you need from me about the WD15 so this can be corrected?
>>>>
>>>>>> xhci_hcd 0000:39:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 13
>>>>>> comp_code 1
>>>>>
>>>>> This is probably an xHC bug. A similar issue is fixed by commit 9da5a1092b13
>>>>> ("xhci: Bad Ethernet performance plugged in ASM1042A host”).
>>>>>
>>>>>> I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
>>>>>> with the TB16 "workaround" enabled in Linux kernel 4.20.0.
>>>>>
>>>>> Is the xHC WD15 connected an ASMedia one?
>>>>
>>>> I don't know. I *think* it identifies as a DSL6340 (see below).
>>>>
>>>> Here is lspci and lsusb:
>>>>
>>>> $ lspci -vt
>>>> -[0000:00]-+-00.0 Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
>>>> +-02.0 Intel Corporation UHD Graphics 620
>>>> +-04.0 Intel Corporation Skylake Processor Thermal Subsystem
>>>> +-14.0 Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
>>>> +-14.2 Intel Corporation Sunrise Point-LP Thermal subsystem
>>>> +-15.0 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
>>>> +-15.1 Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
>>>> +-16.0 Intel Corporation Sunrise Point-LP CSME HECI #1
>>>> +-1c.0-[01-39]----00.0-[02-39]--+-00.0-[03]--
>>>> | +-01.0-[04-38]--
>>>> | \-02.0-[39]----00.0 Intel Corporation DSL6340 USB 3.1
>>>> Controller [Alpine Ridge]
>>>> +-1c.4-[3a]----00.0 Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter
>>>> +-1d.0-[3b]----00.0 Samsung Electronics Co Ltd Device a808
>>>> +-1f.0 Intel Corporation Device 9d4e
>>>> +-1f.2 Intel Corporation Sunrise Point-LP PMC
>>>> +-1f.3 Intel Corporation Sunrise Point-LP HD Audio
>>>> \-1f.4 Intel Corporation Sunrise Point-LP SMBus
>>>
>>>
>>> Mmm.. lspci -vt isn't as verbose as I thought, so here is plain lspci to fill in the blanks:
>>>
>>> $ lspci
>>> 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM
>>> Registers (rev 08)
>>> 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (rev 07)
>>>
>>> 00:04.0 Signal processing controller: Intel Corporation Skylake Processor Thermal Subsystem (rev 08)
>>>
>>> 00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
>>>
>>> 00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
>>>
>>> 00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
>>> (rev 21)
>>> 00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
>>> (rev 21)
>>> 00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
>>>
>>> 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port (rev f1)
>>>
>>> 00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
>>>
>>> 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)
>>>
>>> 00:1f.0 ISA bridge: Intel Corporation Device 9d4e (rev 21)
>>>
>>> 00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
>>>
>>> 00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
>>>
>>> 00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
>>>
>>> 01:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>>>
>>> 02:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>>>
>>> 02:01.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>>>
>>> 02:02.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
>>>
>>> 39:00.0 USB controller: Intel Corporation DSL6340 USB 3.1 Controller [Alpine Ridge]
>>
>> So it’s not an ASMedia one.
>>
>> Before digging further, please make sure the system firmware (BIOS), Thunderbolt controller NVM and WD15 firmware are all up-to-date.
>
> Everything is completely up to date.

Do you happen to use a Dell system? We can do some test here.

Kai-Heng

>
>
> --
> Mark Lord
> Real-Time Remedies Inc.
> [email protected]


2019-01-07 07:08:21

by Mark Lord

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios

On 2019-01-07 1:46 a.m., Kai Heng Feng wrote:
>
> Do you happen to use a Dell system? We can do some test here.

Yes. It is a Dell XPS 13 9360 i7-8550U notebook,
with the Dell WD15 USB-C dock.

--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2019-01-07 17:36:42

by Mario Limonciello

[permalink] [raw]
Subject: RE: r8152: data corruption in various scenarios

> -----Original Message-----
> From: Hayes Wang <[email protected]>
> Sent: Sunday, January 6, 2019 9:54 PM
> To: Mark Lord; Kai Heng Feng
> Cc: Ansis Atteka; David Miller; [email protected]; [email protected];
> [email protected]; nic_swsd; [email protected]; linux-
> [email protected]; Limonciello, Mario; Ryankao
> Subject: RE: r8152: data corruption in various scenarios
>
>
> [EXTERNAL EMAIL]
>
> Monday, January 07, 2019 5:17 AM
> [...]
> >> This is probably an xHC bug. A similar issue is fixed by commit 9da5a1092b13
> >> ("xhci: Bad Ethernet performance plugged in ASM1042A host”).
> >>
> >>> I just got that exact message above, with the r8152 in my 1-day old WD15 dock,
> >>> with the TB16 "workaround" enabled in Linux kernel 4.20.0.
> >>
> >> Is the xHC WD15 connected an ASMedia one?
> >
> > I don't know. I *think* it identifies as a DSL6340 (see below).
> >
>
> According to our record, it is relative to the asmedia.
>

DSL6430 should be referring to the Alpine Ridge controller in the system.

TB16 contains ASMedia host controller. It's a Thunderbolt dock and all USB devices
are connected to ASMedia host controller in the dock.

WD15 does not contain an ASMedia host controller, it connected to system's
USB host controller.

2019-01-07 18:37:30

by Mario Limonciello

[permalink] [raw]
Subject: RE: r8152: data corruption in various scenarios

> -----Original Message-----
> From: Mark Lord <[email protected]>
> Sent: Monday, January 7, 2019 12:06 PM
> To: Limonciello, Mario; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected]
> Subject: Re: r8152: data corruption in various scenarios
>
>
> [EXTERNAL EMAIL]
>
> On 2019-01-07 11:01 a.m., [email protected] wrote:
> >
> > TB16 contains ASMedia host controller. It's a Thunderbolt dock and all USB
> devices
> > are connected to ASMedia host controller in the dock.
> >
> > WD15 does not contain an ASMedia host controller, it connected to system's
> > USB host controller.
>
>
> Thank-you, Mario.
>
> So.. why are we enabling the r8153 (USB-ethernet) workaround on this WD15
> dock?
> The discussion back in 2017 was that only the TB15/TB16 were affected by
> the XHCI overruns it produces?
>
> --

The xHCI overrun workaround should only be applied on TB16/TB16, correct.

Can you double check the verbose information from lsusb for the r8153 device
on your WD15?

I just double checked on my on hand WD15 with an XPS 9380 and it's not activating the
quirk (bcdDevice was different).

If it's the same information as the TB16 (which it sounds like it is) Kai Heng and I will check
around internally to find out why they're looking the same.

I can hypothesize a few guesses of what happened.
My first guess would be a comparison issue with the logic in 176eb614b.

Looking at that commit, I guess I would ask on the compiler behavior of !strcmp().
Would that be matching the less than case as well as the zero case?
If so, it might need to be changed to strcmp() == 0.

My second guess would be maybe newer ethernet NVM in manufacturing.
My third guess would be a manufacturing issue putting wrong NVM image on your WD15.

2019-01-07 18:57:59

by Mark Lord

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios

On 2019-01-07 11:01 a.m., [email protected] wrote:
>
> TB16 contains ASMedia host controller. It's a Thunderbolt dock and all USB devices
> are connected to ASMedia host controller in the dock.
>
> WD15 does not contain an ASMedia host controller, it connected to system's
> USB host controller.


Thank-you, Mario.

So.. why are we enabling the r8153 (USB-ethernet) workaround on this WD15 dock?
The discussion back in 2017 was that only the TB15/TB16 were affected by
the XHCI overruns it produces?

--
Mark Lord
Real-Time Remedies Inc.
[email protected]

2019-01-07 19:59:09

by Mark Lord

[permalink] [raw]
Subject: Re: r8152: data corruption in various scenarios

On 2019-01-07 1:27 p.m., [email protected] wrote:
..
> The xHCI overrun workaround should only be applied on TB16/TB16, correct.
>
> Can you double check the verbose information from lsusb for the r8153 device
> on your WD15?

Sure, see below for the full output.

> If it's the same information as the TB16 (which it sounds like it is) Kai Heng and I will check
> around internally to find out why they're looking the same.

Thanks.

> My second guess would be maybe newer ethernet NVM in manufacturing.
> My third guess would be a manufacturing issue putting wrong NVM image on your WD15.

It could be one of those two things.
Let us know what you discover.

Thanks

Bus 004 Device 003: ID 0bda:8153 Realtek Semiconductor Corp.
Device Descriptor:
bLength 18
bDescriptorType 1
bcdUSB 3.00
bDeviceClass 0 (Defined at Interface level)
bDeviceSubClass 0
bDeviceProtocol 0
bMaxPacketSize0 9
idVendor 0x0bda Realtek Semiconductor Corp.
idProduct 0x8153
bcdDevice 30.11
iManufacturer 1 Realtek
iProduct 2 USB 10/100/1000 LAN
iSerial 6 000002000000
bNumConfigurations 2
Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength 57
bNumInterfaces 1
bConfigurationValue 1
iConfiguration 0
bmAttributes 0xa0
(Bus Powered)
Remote Wakeup
MaxPower 64mA
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
bNumEndpoints 3
bInterfaceClass 255 Vendor Specific Class
bInterfaceSubClass 255 Vendor Specific Subclass
bInterfaceProtocol 0
iInterface 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81 EP 1 IN
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0400 1x 1024 bytes
bInterval 0
bMaxBurst 3
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x02 EP 2 OUT
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0400 1x 1024 bytes
bInterval 0
bMaxBurst 3
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x83 EP 3 IN
bmAttributes 3
Transfer Type Interrupt
Synch Type None
Usage Type Data
wMaxPacketSize 0x0002 1x 2 bytes
bInterval 8
bMaxBurst 0
Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength 98
bNumInterfaces 2
bConfigurationValue 2
iConfiguration 0
bmAttributes 0xa0
(Bus Powered)
Remote Wakeup
MaxPower 64mA
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
bNumEndpoints 1
bInterfaceClass 2 Communications
bInterfaceSubClass 6 Ethernet Networking
bInterfaceProtocol 0
iInterface 5 CDC Communications Control
CDC Header:
bcdCDC 1.10
CDC Union:
bMasterInterface 0
bSlaveInterface 1
CDC Ethernet:
iMacAddress 3 54BF6450FC4F
bmEthernetStatistics 0x00000000
wMaxSegmentSize 1514
wNumberMCFilters 0x0000
bNumberPowerFilters 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x83 EP 3 IN
bmAttributes 3
Transfer Type Interrupt
Synch Type None
Usage Type Data
wMaxPacketSize 0x0010 1x 16 bytes
bInterval 8
bMaxBurst 0
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 1
bAlternateSetting 0
bNumEndpoints 0
bInterfaceClass 10 CDC Data
bInterfaceSubClass 0 Unused
bInterfaceProtocol 0
iInterface 0
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 1
bAlternateSetting 1
bNumEndpoints 2
bInterfaceClass 10 CDC Data
bInterfaceSubClass 0 Unused
bInterfaceProtocol 0
iInterface 4 Ethernet Data
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81 EP 1 IN
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0400 1x 1024 bytes
bInterval 0
bMaxBurst 3
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x02 EP 2 OUT
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0400 1x 1024 bytes
bInterval 0
bMaxBurst 3
Binary Object Store Descriptor:
bLength 5
bDescriptorType 15
wTotalLength 22
bNumDeviceCaps 2
USB 2.0 Extension Device Capability:
bLength 7
bDescriptorType 16
bDevCapabilityType 2
bmAttributes 0x00000002
Link Power Management (LPM) Supported
SuperSpeed USB Device Capability:
bLength 10
bDescriptorType 16
bDevCapabilityType 3
bmAttributes 0x00
wSpeedsSupported 0x000e
Device can operate at Full Speed (12Mbps)
Device can operate at High Speed (480Mbps)
Device can operate at SuperSpeed (5Gbps)
bFunctionalitySupport 2
Lowest fully-functional device speed is High Speed (480Mbps)
bU1DevExitLat 10 micro seconds
bU2DevExitLat 2047 micro seconds
Device Status: 0x000c
(Bus Powered)
U1 Enabled
U2 Enabled

--
Mark Lord
Real-Time Remedies Inc.
[email protected]