2023-10-12 19:31:00

by Doug Anderson

[permalink] [raw]
Subject: [PATCH v3 0/5] r8152: Avoid writing garbage to the adapter's registers

This series is the result of a cooperative debug effort between
Realtek and the ChromeOS team. On ChromeOS, we've noticed that Realtek
Ethernet adapters can sometimes get so wedged that even a reboot of
the host can't get them to enumerate again, assuming that the adapter
was on a powered hub and din't lose power when the host rebooted. This
is sometimes seen in the ChromeOS automated testing lab. The only way
to recover adapters in this state is to manually power cycle them.

I managed to reproduce one instance of this wedging (unknown if this
is truly related to what the test lab sees) by doing this:
1. Start a flood ping from a host to the device.
2. Drop the device into kdb.
3. Wait 90 seconds.
4. Resume from kdb (the "g" command).
5. Wait another 45 seconds.

Upon analysis, Realtek realized this was happening:

1. The Linux driver was getting a "Tx timeout" after resuming from kdb
and then trying to reset itself.
2. As part of the reset, the Linux driver was attempting to do a
read-modify-write of the adapter's registers.
3. The read would fail (due to a timeout) and the driver pretended
that the register contained all 0xFFs. See commit f53a7ad18959
("r8152: Set memory to all 0xFFs on failed reg reads")
4. The driver would take this value of all 0xFFs, modify it, and
attempt to write it back to the adapter.
5. By this time the USB channel seemed to recover and thus we'd
successfully write a value that was mostly 0xFFs to the adpater.
6. The adapter didn't like this and would wedge itself.

Another Engineer also managed to reproduce wedging of the Realtek
Ethernet adpater during a reboot test on an AMD Chromebook. In that
case he was sometimes seeing -EPIPE returned from the control
transfers.

This patch series fixes both issues.

Changes in v3:
- Fixed v2 changelog ending up in the commit message.
- farmework -> framework in comments.

Changes in v2:
- ("Check for unplug in rtl_phy_patch_request()") new for v2.
- ("Check for unplug in r8153b_ups_en() / r8153c_ups_en()") new for v2.
- ("Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE") new for v2.
- Reset patch no longer based on retry patch, since that was dropped.
- Reset patch should be robust even if failures happen in probe.
- Switched booleans to bits in the "flags" variable.
- Check for -ENODEV instead of "udev->state == USB_STATE_NOTATTACHED"

Douglas Anderson (5):
r8152: Increase USB control msg timeout to 5000ms as per spec
r8152: Check for unplug in rtl_phy_patch_request()
r8152: Check for unplug in r8153b_ups_en() / r8153c_ups_en()
r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE
r8152: Block future register access if register access fails

drivers/net/usb/r8152.c | 268 +++++++++++++++++++++++++++++++---------
1 file changed, 209 insertions(+), 59 deletions(-)

--
2.42.0.655.g421f12c284-goog


2023-10-12 19:31:21

by Doug Anderson

[permalink] [raw]
Subject: [PATCH v3 3/5] r8152: Check for unplug in r8153b_ups_en() / r8153c_ups_en()

If the adapter is unplugged while we're looping in r8153b_ups_en() /
r8153c_ups_en() we could end up looping for 10 seconds (20 ms * 500
loops). Add code similar to what's done in other places in the driver
to check for unplug and bail.

Signed-off-by: Douglas Anderson <[email protected]>
---

(no changes since v2)

Changes in v2:
- ("Check for unplug in r8153b_ups_en() / r8153c_ups_en()") new for v2.

drivers/net/usb/r8152.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index fff2f9e67b5f..888d3884821e 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -3663,6 +3663,8 @@ static void r8153b_ups_en(struct r8152 *tp, bool enable)
int i;

for (i = 0; i < 500; i++) {
+ if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ return;
if (ocp_read_word(tp, MCU_TYPE_PLA, PLA_BOOT_CTRL) &
AUTOLOAD_DONE)
break;
@@ -3703,6 +3705,8 @@ static void r8153c_ups_en(struct r8152 *tp, bool enable)
int i;

for (i = 0; i < 500; i++) {
+ if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ return;
if (ocp_read_word(tp, MCU_TYPE_PLA, PLA_BOOT_CTRL) &
AUTOLOAD_DONE)
break;
--
2.42.0.655.g421f12c284-goog

2023-10-12 19:31:27

by Doug Anderson

[permalink] [raw]
Subject: [PATCH v3 1/5] r8152: Increase USB control msg timeout to 5000ms as per spec

According to the comment next to USB_CTRL_GET_TIMEOUT and
USB_CTRL_SET_TIMEOUT, although sending/receiving control messages is
usually quite fast, the spec allows them to take up to 5 seconds.
Let's increase the timeout in the Realtek driver from 500ms to 5000ms
(using the #defines) to account for this.

This is not just a theoretical change. The need for the longer timeout
was seen in testing. Specifically, if you drop a sc7180-trogdor based
Chromebook into the kdb debugger and then "go" again after sitting in
the debugger for a while, the next USB control message takes a long
time. Out of ~40 tests the slowest USB control message was 4.5
seconds.

While dropping into kdb is not exactly an end-user scenario, the above
is similar to what could happen due to an temporary interrupt storm,
what could happen if there was a host controller (HW or SW) issue, or
what could happen if the Realtek device got into a confused state and
needed time to recover.

This change is fairly critical since the r8152 driver in Linux doesn't
expect register reads/writes (which are backed by USB control
messages) to fail.

Fixes: ac718b69301c ("net/usb: new driver for RTL8152")
Suggested-by: Hayes Wang <[email protected]>
Signed-off-by: Douglas Anderson <[email protected]>
---

(no changes since v1)

drivers/net/usb/r8152.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 0c13d9950cd8..482957beae66 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -1212,7 +1212,7 @@ int get_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)

ret = usb_control_msg(tp->udev, tp->pipe_ctrl_in,
RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
- value, index, tmp, size, 500);
+ value, index, tmp, size, USB_CTRL_GET_TIMEOUT);
if (ret < 0)
memset(data, 0xff, size);
else
@@ -1235,7 +1235,7 @@ int set_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)

ret = usb_control_msg(tp->udev, tp->pipe_ctrl_out,
RTL8152_REQ_SET_REGS, RTL8152_REQT_WRITE,
- value, index, tmp, size, 500);
+ value, index, tmp, size, USB_CTRL_SET_TIMEOUT);

kfree(tmp);

@@ -9494,7 +9494,8 @@ static u8 __rtl_get_hw_ver(struct usb_device *udev)

ret = usb_control_msg(udev, usb_rcvctrlpipe(udev, 0),
RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
- PLA_TCR0, MCU_TYPE_PLA, tmp, sizeof(*tmp), 500);
+ PLA_TCR0, MCU_TYPE_PLA, tmp, sizeof(*tmp),
+ USB_CTRL_GET_TIMEOUT);
if (ret > 0)
ocp_data = (__le32_to_cpu(*tmp) >> 16) & VERSION_MASK;

--
2.42.0.655.g421f12c284-goog

2023-10-12 19:31:33

by Doug Anderson

[permalink] [raw]
Subject: [PATCH v3 4/5] r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE

Whenever the RTL8152_UNPLUG is set that just tells the driver that all
accesses will fail and we should just immediately bail. A future patch
will use this same concept at a time when the driver hasn't actually
been unplugged but is about to be reset. Rename the flag in
preparation for the future patch.

This is a no-op change and just a search and replace.

Signed-off-by: Douglas Anderson <[email protected]>
---

(no changes since v2)

Changes in v2:
- ("Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE") new for v2.

drivers/net/usb/r8152.c | 96 ++++++++++++++++++++---------------------
1 file changed, 48 insertions(+), 48 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 888d3884821e..151c3c383080 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -764,7 +764,7 @@ enum rtl_register_content {

/* rtl8152 flags */
enum rtl8152_flags {
- RTL8152_UNPLUG = 0,
+ RTL8152_INACCESSIBLE = 0,
RTL8152_SET_RX_MODE,
WORK_ENABLE,
RTL8152_LINK_CHG,
@@ -1245,7 +1245,7 @@ int set_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)
static void rtl_set_unplug(struct r8152 *tp)
{
if (tp->udev->state == USB_STATE_NOTATTACHED) {
- set_bit(RTL8152_UNPLUG, &tp->flags);
+ set_bit(RTL8152_INACCESSIBLE, &tp->flags);
smp_mb__after_atomic();
}
}
@@ -1256,7 +1256,7 @@ static int generic_ocp_read(struct r8152 *tp, u16 index, u16 size,
u16 limit = 64;
int ret = 0;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return -ENODEV;

/* both size and indix must be 4 bytes align */
@@ -1300,7 +1300,7 @@ static int generic_ocp_write(struct r8152 *tp, u16 index, u16 byteen,
u16 byteen_start, byteen_end, byen;
u16 limit = 512;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return -ENODEV;

/* both size and indix must be 4 bytes align */
@@ -1537,7 +1537,7 @@ static int read_mii_word(struct net_device *netdev, int phy_id, int reg)
struct r8152 *tp = netdev_priv(netdev);
int ret;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return -ENODEV;

if (phy_id != R8152_PHY_ID)
@@ -1553,7 +1553,7 @@ void write_mii_word(struct net_device *netdev, int phy_id, int reg, int val)
{
struct r8152 *tp = netdev_priv(netdev);

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

if (phy_id != R8152_PHY_ID)
@@ -1758,7 +1758,7 @@ static void read_bulk_callback(struct urb *urb)
if (!tp)
return;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

if (!test_bit(WORK_ENABLE, &tp->flags))
@@ -1850,7 +1850,7 @@ static void write_bulk_callback(struct urb *urb)
if (!test_bit(WORK_ENABLE, &tp->flags))
return;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

if (!skb_queue_empty(&tp->tx_queue))
@@ -1871,7 +1871,7 @@ static void intr_callback(struct urb *urb)
if (!test_bit(WORK_ENABLE, &tp->flags))
return;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

switch (status) {
@@ -2615,7 +2615,7 @@ static void bottom_half(struct tasklet_struct *t)
{
struct r8152 *tp = from_tasklet(tp, t, tx_tl);

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

if (!test_bit(WORK_ENABLE, &tp->flags))
@@ -2658,7 +2658,7 @@ int r8152_submit_rx(struct r8152 *tp, struct rx_agg *agg, gfp_t mem_flags)
int ret;

/* The rx would be stopped, so skip submitting */
- if (test_bit(RTL8152_UNPLUG, &tp->flags) ||
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags) ||
!test_bit(WORK_ENABLE, &tp->flags) || !netif_carrier_ok(tp->netdev))
return 0;

@@ -3058,7 +3058,7 @@ static int rtl_enable(struct r8152 *tp)

static int rtl8152_enable(struct r8152 *tp)
{
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return -ENODEV;

set_tx_qlen(tp);
@@ -3145,7 +3145,7 @@ static int rtl8153_enable(struct r8152 *tp)
{
u32 ocp_data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return -ENODEV;

set_tx_qlen(tp);
@@ -3177,7 +3177,7 @@ static void rtl_disable(struct r8152 *tp)
u32 ocp_data;
int i;

- if (test_bit(RTL8152_UNPLUG, &tp->flags)) {
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags)) {
rtl_drop_queued_tx(tp);
return;
}
@@ -3631,7 +3631,7 @@ static u16 r8153_phy_status(struct r8152 *tp, u16 desired)
}

msleep(20);
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
break;
}

@@ -3663,7 +3663,7 @@ static void r8153b_ups_en(struct r8152 *tp, bool enable)
int i;

for (i = 0; i < 500; i++) {
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;
if (ocp_read_word(tp, MCU_TYPE_PLA, PLA_BOOT_CTRL) &
AUTOLOAD_DONE)
@@ -3705,7 +3705,7 @@ static void r8153c_ups_en(struct r8152 *tp, bool enable)
int i;

for (i = 0; i < 500; i++) {
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;
if (ocp_read_word(tp, MCU_TYPE_PLA, PLA_BOOT_CTRL) &
AUTOLOAD_DONE)
@@ -4050,8 +4050,8 @@ static int rtl_phy_patch_request(struct r8152 *tp, bool request, bool wait)
for (i = 0; wait && i < 5000; i++) {
u32 ocp_data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
- break;
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
+ return -ENODEV;

usleep_range(1000, 2000);
ocp_data = ocp_reg_read(tp, OCP_PHY_PATCH_STAT);
@@ -6009,7 +6009,7 @@ static int rtl8156_enable(struct r8152 *tp)
u32 ocp_data;
u16 speed;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return -ENODEV;

r8156_fc_parameter(tp);
@@ -6067,7 +6067,7 @@ static int rtl8156b_enable(struct r8152 *tp)
u32 ocp_data;
u16 speed;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return -ENODEV;

set_tx_qlen(tp);
@@ -6253,7 +6253,7 @@ static int rtl8152_set_speed(struct r8152 *tp, u8 autoneg, u32 speed, u8 duplex,

static void rtl8152_up(struct r8152 *tp)
{
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8152_aldps_en(tp, false);
@@ -6263,7 +6263,7 @@ static void rtl8152_up(struct r8152 *tp)

static void rtl8152_down(struct r8152 *tp)
{
- if (test_bit(RTL8152_UNPLUG, &tp->flags)) {
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags)) {
rtl_drop_queued_tx(tp);
return;
}
@@ -6278,7 +6278,7 @@ static void rtl8153_up(struct r8152 *tp)
{
u32 ocp_data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8153_u1u2en(tp, false);
@@ -6318,7 +6318,7 @@ static void rtl8153_down(struct r8152 *tp)
{
u32 ocp_data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags)) {
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags)) {
rtl_drop_queued_tx(tp);
return;
}
@@ -6339,7 +6339,7 @@ static void rtl8153b_up(struct r8152 *tp)
{
u32 ocp_data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8153b_u1u2en(tp, false);
@@ -6363,7 +6363,7 @@ static void rtl8153b_down(struct r8152 *tp)
{
u32 ocp_data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags)) {
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags)) {
rtl_drop_queued_tx(tp);
return;
}
@@ -6400,7 +6400,7 @@ static void rtl8153c_up(struct r8152 *tp)
{
u32 ocp_data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8153b_u1u2en(tp, false);
@@ -6481,7 +6481,7 @@ static void rtl8156_up(struct r8152 *tp)
{
u32 ocp_data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8153b_u1u2en(tp, false);
@@ -6554,7 +6554,7 @@ static void rtl8156_down(struct r8152 *tp)
{
u32 ocp_data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags)) {
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags)) {
rtl_drop_queued_tx(tp);
return;
}
@@ -6692,7 +6692,7 @@ static void rtl_work_func_t(struct work_struct *work)
/* If the device is unplugged or !netif_running(), the workqueue
* doesn't need to wake the device, and could return directly.
*/
- if (test_bit(RTL8152_UNPLUG, &tp->flags) || !netif_running(tp->netdev))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags) || !netif_running(tp->netdev))
return;

if (usb_autopm_get_interface(tp->intf) < 0)
@@ -6731,7 +6731,7 @@ static void rtl_hw_phy_work_func_t(struct work_struct *work)
{
struct r8152 *tp = container_of(work, struct r8152, hw_phy_work.work);

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

if (usb_autopm_get_interface(tp->intf) < 0)
@@ -6858,7 +6858,7 @@ static int rtl8152_close(struct net_device *netdev)
netif_stop_queue(netdev);

res = usb_autopm_get_interface(tp->intf);
- if (res < 0 || test_bit(RTL8152_UNPLUG, &tp->flags)) {
+ if (res < 0 || test_bit(RTL8152_INACCESSIBLE, &tp->flags)) {
rtl_drop_queued_tx(tp);
rtl_stop_rx(tp);
} else {
@@ -6891,7 +6891,7 @@ static void r8152b_init(struct r8152 *tp)
u32 ocp_data;
u16 data;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

data = r8152_mdio_read(tp, MII_BMCR);
@@ -6935,7 +6935,7 @@ static void r8153_init(struct r8152 *tp)
u16 data;
int i;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8153_u1u2en(tp, false);
@@ -6946,7 +6946,7 @@ static void r8153_init(struct r8152 *tp)
break;

msleep(20);
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
break;
}

@@ -7075,7 +7075,7 @@ static void r8153b_init(struct r8152 *tp)
u16 data;
int i;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8153b_u1u2en(tp, false);
@@ -7086,7 +7086,7 @@ static void r8153b_init(struct r8152 *tp)
break;

msleep(20);
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
break;
}

@@ -7157,7 +7157,7 @@ static void r8153c_init(struct r8152 *tp)
u16 data;
int i;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8153b_u1u2en(tp, false);
@@ -7177,7 +7177,7 @@ static void r8153c_init(struct r8152 *tp)
break;

msleep(20);
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;
}

@@ -8006,7 +8006,7 @@ static void r8156_init(struct r8152 *tp)
u16 data;
int i;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

ocp_data = ocp_read_byte(tp, MCU_TYPE_USB, USB_ECM_OP);
@@ -8027,7 +8027,7 @@ static void r8156_init(struct r8152 *tp)
break;

msleep(20);
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;
}

@@ -8102,7 +8102,7 @@ static void r8156b_init(struct r8152 *tp)
u16 data;
int i;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

ocp_data = ocp_read_byte(tp, MCU_TYPE_USB, USB_ECM_OP);
@@ -8136,7 +8136,7 @@ static void r8156b_init(struct r8152 *tp)
break;

msleep(20);
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;
}

@@ -9165,7 +9165,7 @@ static int rtl8152_ioctl(struct net_device *netdev, struct ifreq *rq, int cmd)
struct mii_ioctl_data *data = if_mii(rq);
int res;

- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return -ENODEV;

res = usb_autopm_get_interface(tp->intf);
@@ -9267,7 +9267,7 @@ static const struct net_device_ops rtl8152_netdev_ops = {

static void rtl8152_unload(struct r8152 *tp)
{
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

if (tp->version != RTL_VER_01)
@@ -9276,7 +9276,7 @@ static void rtl8152_unload(struct r8152 *tp)

static void rtl8153_unload(struct r8152 *tp)
{
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8153_power_cut_en(tp, false);
@@ -9284,7 +9284,7 @@ static void rtl8153_unload(struct r8152 *tp)

static void rtl8153b_unload(struct r8152 *tp)
{
- if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
return;

r8153b_power_cut_en(tp, false);
--
2.42.0.655.g421f12c284-goog

2023-10-12 19:31:52

by Doug Anderson

[permalink] [raw]
Subject: [PATCH v3 5/5] r8152: Block future register access if register access fails

Even though the functions to read/write registers can fail, most of
the places in the r8152 driver that read/write register values don't
check error codes. The lack of error code checking is problematic in
at least two ways.

The first problem is that the r8152 driver often uses code patterns
similar to this:
x = read_register()
x = x | SOME_BIT;
write_register(x);

...with the above pattern, if the read_register() fails and returns
garbage then we'll end up trying to write modified garbage back to the
Realtek adapter. If the write_register() succeeds that's bad. Note
that as of commit f53a7ad18959 ("r8152: Set memory to all 0xFFs on
failed reg reads") the "garbage" returned by read_register() will at
least be consistent garbage, but it is still garbage.

It turns out that this problem is very serious. Writing garbage to
some of the hardware registers on the Ethernet adapter can put the
adapter in such a bad state that it needs to be power cycled (fully
unplugged and plugged in again) before it can enumerate again.

The second problem is that the r8152 driver generally has functions
that are long sequences of register writes. Assuming everything will
be OK if a random register write fails in the middle isn't a great
assumption.

One might wonder if the above two problems are real. You could ask if
we would really have a successful write after a failed read. It turns
out that the answer appears to be "yes, this can happen". In fact,
we've seen at least two distinct failure modes where this happens.

On a sc7180-trogdor Chromebook if you drop into kdb for a while and
then resume, you can see:
1. We get a "Tx timeout"
2. The "Tx timeout" queues up a USB reset.
3. In rtl8152_pre_reset() we try to reinit the hardware.
4. The first several (2-9) register accesses fail with a timeout, then
things recover.

The above test case was actually fixed by the patch ("r8152: Increase
USB control msg timeout to 5000ms as per spec") but at least shows
that we really can see successful calls after failed ones.

On a different (AMD) based Chromebook with a particular adapter, we
found that during reboot tests we'd also sometimes get a transitory
failure. In this case we saw -EPIPE being returned sometimes. Retrying
worked, but retrying is not always safe for all register accesses
since reading/writing some registers might have side effects (like
registers that clear on read).

Let's fully lock out all register access if a register access fails.
When we do this, we'll try to queue up a USB reset and try to unlock
register access after the reset. This is slightly tricker than it
sounds since the r8152 driver has an optimized reset sequence that
only works reliably after probe happens. In order to handle this, we
avoid the optimized reset if probe didn't finish.

When locking out access, we'll use the existing infrastructure that
the driver was using when it detected we were unplugged. This keeps us
from getting stuck in delay loops in some parts of the driver.

Signed-off-by: Douglas Anderson <[email protected]>
---
Originally when looking at this problem I thought that the obvious
solution was to "just" add better error handling to the driver. This
_sounds_ appealing, but it's a massive change and touches a
significant portion of the lines in this driver. It's also not always
obvious what the driver should be doing to handle errors.

If you feel like you need to be convinced and to see what it looked
like to add better error handling, I put up my "work in progress"
patch when I was investigating this at: https://crrev.com/c/4937290

There is still some active debate between the two approaches, though,
so it would be interesting to hear if anyone had any opinions.

Changes in v3:
- Fixed v2 changelog ending up in the commit message.
- farmework -> framework in comments.

Changes in v2:
- Reset patch no longer based on retry patch, since that was dropped.
- Reset patch should be robust even if failures happen in probe.
- Switched booleans to bits in the "flags" variable.
- Check for -ENODEV instead of "udev->state == USB_STATE_NOTATTACHED"

drivers/net/usb/r8152.c | 176 ++++++++++++++++++++++++++++++++++++----
1 file changed, 159 insertions(+), 17 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 151c3c383080..fce7c58f8142 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -773,6 +773,8 @@ enum rtl8152_flags {
SCHEDULE_TASKLET,
GREEN_ETHERNET,
RX_EPROTO,
+ IN_PRE_RESET,
+ PROBED_WITH_NO_ERRORS,
};

#define DEVICE_ID_LENOVO_USB_C_TRAVEL_HUB 0x721e
@@ -953,6 +955,8 @@ struct r8152 {
u8 version;
u8 duplex;
u8 autoneg;
+
+ unsigned int reg_access_reset_count;
};

/**
@@ -1200,6 +1204,91 @@ static unsigned int agg_buf_sz = 16384;

#define RTL_LIMITED_TSO_SIZE (size_to_mtu(agg_buf_sz) - sizeof(struct tx_desc))

+/* If register access fails then we block access and issue a reset. If this
+ * happens too many times in a row without a successful access then we stop
+ * trying to reset and just leave access blocked.
+ */
+#define REGISTER_ACCESS_MAX_RESETS 3
+
+static void rtl_set_inaccessible(struct r8152 *tp)
+{
+ set_bit(RTL8152_INACCESSIBLE, &tp->flags);
+ smp_mb__after_atomic();
+}
+
+static void rtl_set_accessible(struct r8152 *tp)
+{
+ clear_bit(RTL8152_INACCESSIBLE, &tp->flags);
+ smp_mb__after_atomic();
+}
+
+static
+int r8152_control_msg(struct r8152 *tp, unsigned int pipe, __u8 request,
+ __u8 requesttype, __u16 value, __u16 index, void *data,
+ __u16 size, const char *msg_tag)
+{
+ struct usb_device *udev = tp->udev;
+ int ret;
+
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags))
+ return -ENODEV;
+
+ ret = usb_control_msg(udev, pipe, request, requesttype,
+ value, index, data, size,
+ USB_CTRL_GET_TIMEOUT);
+
+ /* No need to issue a reset report an error if the USB device got
+ * unplugged; just return immediately.
+ */
+ if (ret == -ENODEV)
+ return ret;
+
+ /* If the write was successful then we're done */
+ if (ret >= 0) {
+ tp->reg_access_reset_count = 0;
+ return ret;
+ }
+
+ dev_err(&udev->dev,
+ "Failed to %s %d bytes at %#06x/%#06x (%d)\n",
+ msg_tag, size, value, index, ret);
+
+ /* Block all future register access until we reset. Much of the oode
+ * in the driver doesn't check for errors. Notably, many parts of the
+ * driver do a read/modify/write of a register value without
+ * confirming that the read succeeded. Writing back modified garbage
+ * like this can fully wedge the adapter, requiring a power cycle.
+ */
+ rtl_set_inaccessible(tp);
+
+ /* Failing to access registers in pre-reset is not surprising since we
+ * wouldn't be resetting if things were behaving normally. The register
+ * access we do in pre-reset isn't truly mandatory--we're just reusing
+ * the disable() function and trying to be nice by powering the
+ * adapter down before resetting it. Thus, if we're in pre-reset,
+ * we'll return right away and not try to queue up yet another reset.
+ * We know the post-reset is already coming.
+ *
+ * We'll also return right away if we haven't finished probe. At the
+ * end of probe we'll queue the reset just to make sure it doesn't
+ * timeout.
+ */
+ if (test_bit(IN_PRE_RESET, &tp->flags) ||
+ !test_bit(PROBED_WITH_NO_ERRORS, &tp->flags))
+ return ret;
+
+ if (tp->reg_access_reset_count < REGISTER_ACCESS_MAX_RESETS) {
+ usb_queue_reset_device(tp->intf);
+ tp->reg_access_reset_count++;
+ } else if (tp->reg_access_reset_count == REGISTER_ACCESS_MAX_RESETS) {
+ dev_err(&udev->dev,
+ "Tried to reset %d times; giving up.\n",
+ REGISTER_ACCESS_MAX_RESETS);
+ }
+
+ return ret;
+}
+
static
int get_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)
{
@@ -1210,9 +1299,10 @@ int get_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)
if (!tmp)
return -ENOMEM;

- ret = usb_control_msg(tp->udev, tp->pipe_ctrl_in,
- RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
- value, index, tmp, size, USB_CTRL_GET_TIMEOUT);
+ ret = r8152_control_msg(tp, tp->pipe_ctrl_in,
+ RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
+ value, index, tmp, size, "read");
+
if (ret < 0)
memset(data, 0xff, size);
else
@@ -1233,9 +1323,9 @@ int set_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)
if (!tmp)
return -ENOMEM;

- ret = usb_control_msg(tp->udev, tp->pipe_ctrl_out,
- RTL8152_REQ_SET_REGS, RTL8152_REQT_WRITE,
- value, index, tmp, size, USB_CTRL_SET_TIMEOUT);
+ ret = r8152_control_msg(tp, tp->pipe_ctrl_out,
+ RTL8152_REQ_SET_REGS, RTL8152_REQT_WRITE,
+ value, index, tmp, size, "write");

kfree(tmp);

@@ -1244,10 +1334,8 @@ int set_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)

static void rtl_set_unplug(struct r8152 *tp)
{
- if (tp->udev->state == USB_STATE_NOTATTACHED) {
- set_bit(RTL8152_INACCESSIBLE, &tp->flags);
- smp_mb__after_atomic();
- }
+ if (tp->udev->state == USB_STATE_NOTATTACHED)
+ rtl_set_inaccessible(tp);
}

static int generic_ocp_read(struct r8152 *tp, u16 index, u16 size,
@@ -8265,6 +8353,19 @@ static int rtl8152_pre_reset(struct usb_interface *intf)
if (!tp)
return 0;

+ /* We can only use the optimized reset if we made it to the end of
+ * probe without any register access fails, which sets
+ * `PROBED_WITH_NO_ERRORS` to true. If we didn't have that then return
+ * an error here which tells the USB framework to fully unbind/rebind
+ * our driver.
+ */
+ mutex_lock(&tp->control);
+ if (!test_bit(PROBED_WITH_NO_ERRORS, &tp->flags)) {
+ mutex_unlock(&tp->control);
+ return -EIO;
+ }
+ mutex_unlock(&tp->control);
+
netdev = tp->netdev;
if (!netif_running(netdev))
return 0;
@@ -8277,7 +8378,9 @@ static int rtl8152_pre_reset(struct usb_interface *intf)
napi_disable(&tp->napi);
if (netif_carrier_ok(netdev)) {
mutex_lock(&tp->control);
+ set_bit(IN_PRE_RESET, &tp->flags);
tp->rtl_ops.disable(tp);
+ clear_bit(IN_PRE_RESET, &tp->flags);
mutex_unlock(&tp->control);
}

@@ -8293,6 +8396,10 @@ static int rtl8152_post_reset(struct usb_interface *intf)
if (!tp)
return 0;

+ mutex_lock(&tp->control);
+ rtl_set_accessible(tp);
+ mutex_unlock(&tp->control);
+
/* reset the MAC address in case of policy change */
if (determine_ethernet_addr(tp, &sa) >= 0) {
rtnl_lock();
@@ -9494,17 +9601,30 @@ static u8 __rtl_get_hw_ver(struct usb_device *udev)
__le32 *tmp;
u8 version;
int ret;
+ int i;

tmp = kmalloc(sizeof(*tmp), GFP_KERNEL);
if (!tmp)
return 0;

- ret = usb_control_msg(udev, usb_rcvctrlpipe(udev, 0),
- RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
- PLA_TCR0, MCU_TYPE_PLA, tmp, sizeof(*tmp),
- USB_CTRL_GET_TIMEOUT);
- if (ret > 0)
- ocp_data = (__le32_to_cpu(*tmp) >> 16) & VERSION_MASK;
+ /* Retry up to 3 times in case there is a transitory error. We do this
+ * since retrying a read of the version is always safe and this
+ * function doesn't take advantage of r8152_control_msg() which would
+ * queue up a reset upon error.
+ */
+ for (i = 0; i < 3; i++) {
+ ret = usb_control_msg(udev, usb_rcvctrlpipe(udev, 0),
+ RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
+ PLA_TCR0, MCU_TYPE_PLA, tmp, sizeof(*tmp),
+ USB_CTRL_GET_TIMEOUT);
+ if (ret > 0) {
+ ocp_data = (__le32_to_cpu(*tmp) >> 16) & VERSION_MASK;
+ break;
+ }
+ }
+
+ if (i != 0 && ret > 0)
+ dev_warn(&udev->dev, "Needed %d retries to read version\n", i);

kfree(tmp);

@@ -9784,7 +9904,29 @@ static int rtl8152_probe(struct usb_interface *intf,
else
device_set_wakeup_enable(&udev->dev, false);

- netif_info(tp, probe, netdev, "%s\n", DRIVER_VERSION);
+ mutex_lock(&tp->control);
+ if (test_bit(RTL8152_INACCESSIBLE, &tp->flags)) {
+ /* If the device is marked inaccessible before probe even
+ * finished then one of two things happened. Either we got a
+ * USB error during probe or the user already unplugged the
+ * device.
+ *
+ * If we got a USB error during probe then we skipped doing a
+ * reset in r8152_control_msg() and deferred it to here. This
+ * is because the queued reset will give up after 1 second
+ * (see usb_lock_device_for_reset()) and we want to make sure
+ * that we queue things up right before probe finishes.
+ *
+ * If the user already unplugged the device then the USB
+ * framework will call unbind right away for us. The extra
+ * reset we queue up here will be harmless.
+ */
+ usb_queue_reset_device(tp->intf);
+ } else {
+ set_bit(PROBED_WITH_NO_ERRORS, &tp->flags);
+ netif_info(tp, probe, netdev, "%s\n", DRIVER_VERSION);
+ }
+ mutex_unlock(&tp->control);

return 0;

--
2.42.0.655.g421f12c284-goog

2023-10-12 19:31:56

by Doug Anderson

[permalink] [raw]
Subject: [PATCH v3 2/5] r8152: Check for unplug in rtl_phy_patch_request()

If the adapter is unplugged while we're looping in
rtl_phy_patch_request() we could end up looping for 10 seconds (2 ms *
5000 loops). Add code similar to what's done in other places in the
driver to check for unplug and bail.

Signed-off-by: Douglas Anderson <[email protected]>
---

(no changes since v2)

Changes in v2:
- ("Check for unplug in rtl_phy_patch_request()") new for v2.

drivers/net/usb/r8152.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 482957beae66..fff2f9e67b5f 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -4046,6 +4046,9 @@ static int rtl_phy_patch_request(struct r8152 *tp, bool request, bool wait)
for (i = 0; wait && i < 5000; i++) {
u32 ocp_data;

+ if (test_bit(RTL8152_UNPLUG, &tp->flags))
+ break;
+
usleep_range(1000, 2000);
ocp_data = ocp_reg_read(tp, OCP_PHY_PATCH_STAT);
if ((ocp_data & PATCH_READY) ^ check)
--
2.42.0.655.g421f12c284-goog

2023-10-16 09:17:42

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH v3 5/5] r8152: Block future register access if register access fails

Douglas Anderson <[email protected]>
> Sent: Friday, October 13, 2023 3:25 AM
[...]
> static int generic_ocp_read(struct r8152 *tp, u16 index, u16 size,
> @@ -8265,6 +8353,19 @@ static int rtl8152_pre_reset(struct usb_interface
> *intf)
> if (!tp)
> return 0;
>
> + /* We can only use the optimized reset if we made it to the end of
> + * probe without any register access fails, which sets
> + * `PROBED_WITH_NO_ERRORS` to true. If we didn't have that then return
> + * an error here which tells the USB framework to fully unbind/rebind
> + * our driver.

Would you stay in a loop of unbind and rebind,
if the control transfers in the probe() are not always successful?
I just think about the worst case that at least one control always fails in probe().

> + */
> + mutex_lock(&tp->control);

I don't think you need the mutex for testing the bit.

> + if (!test_bit(PROBED_WITH_NO_ERRORS, &tp->flags)) {
> + mutex_unlock(&tp->control);
> + return -EIO;
> + }
> + mutex_unlock(&tp->control);
> +
> netdev = tp->netdev;
> if (!netif_running(netdev))
> return 0;
> @@ -8277,7 +8378,9 @@ static int rtl8152_pre_reset(struct usb_interface
> *intf)
> napi_disable(&tp->napi);
> if (netif_carrier_ok(netdev)) {
> mutex_lock(&tp->control);
> + set_bit(IN_PRE_RESET, &tp->flags);
> tp->rtl_ops.disable(tp);
> + clear_bit(IN_PRE_RESET, &tp->flags);
> mutex_unlock(&tp->control);
> }
>
> @@ -8293,6 +8396,10 @@ static int rtl8152_post_reset(struct usb_interface
> *intf)
> if (!tp)
> return 0;
>
> + mutex_lock(&tp->control);

I don't think clear_bit() needs the protection of mutex.
I think you could call rtl_set_accessible() directly.

> + rtl_set_accessible(tp);
> + mutex_unlock(&tp->control);
> +
> /* reset the MAC address in case of policy change */
> if (determine_ethernet_addr(tp, &sa) >= 0) {
> rtnl_lock();

Best Regards,
Hayes

2023-10-16 16:53:59

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] r8152: Block future register access if register access fails

Hi,

On Mon, Oct 16, 2023 at 2:16 AM Hayes Wang <[email protected]> wrote:
>
> Douglas Anderson <[email protected]>
> > Sent: Friday, October 13, 2023 3:25 AM
> [...]
> > static int generic_ocp_read(struct r8152 *tp, u16 index, u16 size,
> > @@ -8265,6 +8353,19 @@ static int rtl8152_pre_reset(struct usb_interface
> > *intf)
> > if (!tp)
> > return 0;
> >
> > + /* We can only use the optimized reset if we made it to the end of
> > + * probe without any register access fails, which sets
> > + * `PROBED_WITH_NO_ERRORS` to true. If we didn't have that then return
> > + * an error here which tells the USB framework to fully unbind/rebind
> > + * our driver.
>
> Would you stay in a loop of unbind and rebind,
> if the control transfers in the probe() are not always successful?
> I just think about the worst case that at least one control always fails in probe().

We won't! :-) One of the first things that rtl8152_probe() does is to
call rtl8152_get_version(). That goes through to
rtl8152_get_version(). That function _doesn't_ queue up a reset if
there are communication problems, but it does do 3 retries of the
read. So if all 3 reads fail then we will permanently fail probe,
which I think is the correct thing to do.

I can update the comment in __rtl_get_hw_ver() to make it more obvious
that this is by design?

>
> > + */
> > + mutex_lock(&tp->control);
>
> I don't think you need the mutex for testing the bit.

Sure, I'll remove it.


> > + if (!test_bit(PROBED_WITH_NO_ERRORS, &tp->flags)) {
> > + mutex_unlock(&tp->control);
> > + return -EIO;
> > + }
> > + mutex_unlock(&tp->control);
> > +
> > netdev = tp->netdev;
> > if (!netif_running(netdev))
> > return 0;
> > @@ -8277,7 +8378,9 @@ static int rtl8152_pre_reset(struct usb_interface
> > *intf)
> > napi_disable(&tp->napi);
> > if (netif_carrier_ok(netdev)) {
> > mutex_lock(&tp->control);
> > + set_bit(IN_PRE_RESET, &tp->flags);
> > tp->rtl_ops.disable(tp);
> > + clear_bit(IN_PRE_RESET, &tp->flags);
> > mutex_unlock(&tp->control);
> > }
> >
> > @@ -8293,6 +8396,10 @@ static int rtl8152_post_reset(struct usb_interface
> > *intf)
> > if (!tp)
> > return 0;
> >
> > + mutex_lock(&tp->control);
>
> I don't think clear_bit() needs the protection of mutex.
> I think you could call rtl_set_accessible() directly.

Agreed, I'll take this out.


Unless something else comes up, I'll send a new version tomorrow with
the above small changes.

-Doug

2023-10-17 13:08:54

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH v3 5/5] r8152: Block future register access if register access fails

Doug Anderson <[email protected]>
> Sent: Tuesday, October 17, 2023 12:47 AM
[...
> > > static int generic_ocp_read(struct r8152 *tp, u16 index, u16 size,
> > > @@ -8265,6 +8353,19 @@ static int rtl8152_pre_reset(struct
> usb_interface
> > > *intf)
> > > if (!tp)
> > > return 0;
> > >
> > > + /* We can only use the optimized reset if we made it to the end of
> > > + * probe without any register access fails, which sets
> > > + * `PROBED_WITH_NO_ERRORS` to true. If we didn't have that then return
> > > + * an error here which tells the USB framework to fully unbind/rebind
> > > + * our driver.
> >
> > Would you stay in a loop of unbind and rebind,
> > if the control transfers in the probe() are not always successful?
> > I just think about the worst case that at least one control always fails in probe().
>
> We won't! :-) One of the first things that rtl8152_probe() does is to
> call rtl8152_get_version(). That goes through to
> rtl8152_get_version(). That function _doesn't_ queue up a reset if
> there are communication problems, but it does do 3 retries of the
> read. So if all 3 reads fail then we will permanently fail probe,
> which I think is the correct thing to do.

The probe() contains control transfers in
1. rtl8152_get_version()
2. tp->rtl_ops.init()

If one of the 3 control transfers in 1) is successful AND
any control transfer in 2) fails,
you would queue a usb reset which would unbind/rebind the driver.
Then, the loop starts.
The loop would be broken, if and only if
a) all control transfers in 1) fail, OR
b) all control transfers in 2) succeed.

That is, the loop would be broken when the fail rate of the control transfer is high or low enough.
Otherwise, you would queue a usb reset again and again.
For example, if the fail rate of the control transfer is 10% ~ 60%,
I think you have high probability to keep the loop continually.
Would it never happen?

Best Regards,
Hayes


2023-10-17 14:17:39

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] r8152: Block future register access if register access fails

Hi,

On Tue, Oct 17, 2023 at 6:07 AM Hayes Wang <[email protected]> wrote:
>
> Doug Anderson <[email protected]>
> > Sent: Tuesday, October 17, 2023 12:47 AM
> [...
> > > > static int generic_ocp_read(struct r8152 *tp, u16 index, u16 size,
> > > > @@ -8265,6 +8353,19 @@ static int rtl8152_pre_reset(struct
> > usb_interface
> > > > *intf)
> > > > if (!tp)
> > > > return 0;
> > > >
> > > > + /* We can only use the optimized reset if we made it to the end of
> > > > + * probe without any register access fails, which sets
> > > > + * `PROBED_WITH_NO_ERRORS` to true. If we didn't have that then return
> > > > + * an error here which tells the USB framework to fully unbind/rebind
> > > > + * our driver.
> > >
> > > Would you stay in a loop of unbind and rebind,
> > > if the control transfers in the probe() are not always successful?
> > > I just think about the worst case that at least one control always fails in probe().
> >
> > We won't! :-) One of the first things that rtl8152_probe() does is to
> > call rtl8152_get_version(). That goes through to
> > rtl8152_get_version(). That function _doesn't_ queue up a reset if
> > there are communication problems, but it does do 3 retries of the
> > read. So if all 3 reads fail then we will permanently fail probe,
> > which I think is the correct thing to do.
>
> The probe() contains control transfers in
> 1. rtl8152_get_version()
> 2. tp->rtl_ops.init()
>
> If one of the 3 control transfers in 1) is successful AND
> any control transfer in 2) fails,
> you would queue a usb reset which would unbind/rebind the driver.
> Then, the loop starts.
> The loop would be broken, if and only if
> a) all control transfers in 1) fail, OR
> b) all control transfers in 2) succeed.
>
> That is, the loop would be broken when the fail rate of the control transfer is high or low enough.
> Otherwise, you would queue a usb reset again and again.
> For example, if the fail rate of the control transfer is 10% ~ 60%,
> I think you have high probability to keep the loop continually.
> Would it never happen?

Actually, even with a failure rate of 10% I don't think you'll end up
with a fully continuous loop, right? All you need is to get 3 failures
in a row in rtl8152_get_version() to get out of the loop. So with a
10% failure rate you'd unbind/bind 1000 times (on average) and then
(finally) give up. With a 50% failure rate I think you'd only
unbind/bind 8 times on average, right? Of course, I guess 1000 loops
is pretty close to infinite.

In any case, we haven't actually seen hardware that fails like this.
We've seen failure rates that are much much lower and we can imagine
failure rates that are 100% if we're got really broken hardware. Do
you think cases where failure rates are middle-of-the-road are likely?

I would also say that nothing we can do can perfectly handle faulty
hardware. If we're imagining theoretical hardware, we could imagine
theoretical hardware that de-enumerated itself and re-enumerated
itself every half second because the firmware on the device crashed or
some regulator kept dropping. This faulty hardware would also cause an
infinite loop of de-enumeration and re-enumeration, right?

Presumably if we get into either case, the user will realize that the
hardware isn't working and will unplug it from the system. While the
system is doing the loop of trying to enumerate the hardware, it will
be taking up a bunch of extra CPU cycles but (I believe) it won't be
fully locked up or anything. The machine will still function and be
able to do non-Ethernet activities, right? I would say that the worst
thing about this state would be that it would stress corner cases in
the reset of the USB subsystem, possibly ticking bugs.

So I guess I would summarize all the above as:

If hardware is broken in just the right way then this patch could
cause a nearly infinite unbinding/rebinding of the r8152 driver.
However:

1. It doesn't seem terribly likely for hardware to be broken in just this way.

2. We haven't seen hardware broken in just this way.

3. Hardware broken in a slightly different way could cause infinite
unbinding/rebinding even without this patch.

4. Infinite unbinding/rebinding of a USB adapter isn't great, but not
the absolute worst thing.


That all being said, if we wanted to address this we could try two
different ways:

a) We could add a global in the r8152 driver and limit the number of
times we reset. This gets a little ugly because if we have multiple
r8152 adapters plugged in then the same global would be used for both,
but maybe it's OK?

b) We could improve the USB core to somehow prevent usb_reset_device()
from running too much on a given device?


...though I would re-emphasize that I don't think this is something we
need to address now. If later we actually see a problem we can always
address it then.


-Doug

2023-10-17 18:38:12

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] r8152: Block future register access if register access fails

Hi,

On Tue, Oct 17, 2023 at 7:17 AM Doug Anderson <[email protected]> wrote:
>
> Hi,
>
> On Tue, Oct 17, 2023 at 6:07 AM Hayes Wang <[email protected]> wrote:
> >
> > Doug Anderson <[email protected]>
> > > Sent: Tuesday, October 17, 2023 12:47 AM
> > [...
> > > > > static int generic_ocp_read(struct r8152 *tp, u16 index, u16 size,
> > > > > @@ -8265,6 +8353,19 @@ static int rtl8152_pre_reset(struct
> > > usb_interface
> > > > > *intf)
> > > > > if (!tp)
> > > > > return 0;
> > > > >
> > > > > + /* We can only use the optimized reset if we made it to the end of
> > > > > + * probe without any register access fails, which sets
> > > > > + * `PROBED_WITH_NO_ERRORS` to true. If we didn't have that then return
> > > > > + * an error here which tells the USB framework to fully unbind/rebind
> > > > > + * our driver.
> > > >
> > > > Would you stay in a loop of unbind and rebind,
> > > > if the control transfers in the probe() are not always successful?
> > > > I just think about the worst case that at least one control always fails in probe().
> > >
> > > We won't! :-) One of the first things that rtl8152_probe() does is to
> > > call rtl8152_get_version(). That goes through to
> > > rtl8152_get_version(). That function _doesn't_ queue up a reset if
> > > there are communication problems, but it does do 3 retries of the
> > > read. So if all 3 reads fail then we will permanently fail probe,
> > > which I think is the correct thing to do.
> >
> > The probe() contains control transfers in
> > 1. rtl8152_get_version()
> > 2. tp->rtl_ops.init()
> >
> > If one of the 3 control transfers in 1) is successful AND
> > any control transfer in 2) fails,
> > you would queue a usb reset which would unbind/rebind the driver.
> > Then, the loop starts.
> > The loop would be broken, if and only if
> > a) all control transfers in 1) fail, OR
> > b) all control transfers in 2) succeed.
> >
> > That is, the loop would be broken when the fail rate of the control transfer is high or low enough.
> > Otherwise, you would queue a usb reset again and again.
> > For example, if the fail rate of the control transfer is 10% ~ 60%,
> > I think you have high probability to keep the loop continually.
> > Would it never happen?
>
> Actually, even with a failure rate of 10% I don't think you'll end up
> with a fully continuous loop, right? All you need is to get 3 failures
> in a row in rtl8152_get_version() to get out of the loop. So with a
> 10% failure rate you'd unbind/bind 1000 times (on average) and then
> (finally) give up. With a 50% failure rate I think you'd only
> unbind/bind 8 times on average, right? Of course, I guess 1000 loops
> is pretty close to infinite.
>
> In any case, we haven't actually seen hardware that fails like this.
> We've seen failure rates that are much much lower and we can imagine
> failure rates that are 100% if we're got really broken hardware. Do
> you think cases where failure rates are middle-of-the-road are likely?
>
> I would also say that nothing we can do can perfectly handle faulty
> hardware. If we're imagining theoretical hardware, we could imagine
> theoretical hardware that de-enumerated itself and re-enumerated
> itself every half second because the firmware on the device crashed or
> some regulator kept dropping. This faulty hardware would also cause an
> infinite loop of de-enumeration and re-enumeration, right?
>
> Presumably if we get into either case, the user will realize that the
> hardware isn't working and will unplug it from the system. While the
> system is doing the loop of trying to enumerate the hardware, it will
> be taking up a bunch of extra CPU cycles but (I believe) it won't be
> fully locked up or anything. The machine will still function and be
> able to do non-Ethernet activities, right? I would say that the worst
> thing about this state would be that it would stress corner cases in
> the reset of the USB subsystem, possibly ticking bugs.
>
> So I guess I would summarize all the above as:
>
> If hardware is broken in just the right way then this patch could
> cause a nearly infinite unbinding/rebinding of the r8152 driver.
> However:
>
> 1. It doesn't seem terribly likely for hardware to be broken in just this way.
>
> 2. We haven't seen hardware broken in just this way.
>
> 3. Hardware broken in a slightly different way could cause infinite
> unbinding/rebinding even without this patch.
>
> 4. Infinite unbinding/rebinding of a USB adapter isn't great, but not
> the absolute worst thing.
>
>
> That all being said, if we wanted to address this we could try two
> different ways:
>
> a) We could add a global in the r8152 driver and limit the number of
> times we reset. This gets a little ugly because if we have multiple
> r8152 adapters plugged in then the same global would be used for both,
> but maybe it's OK?
>
> b) We could improve the USB core to somehow prevent usb_reset_device()
> from running too much on a given device?
>
>
> ...though I would re-emphasize that I don't think this is something we
> need to address now. If later we actually see a problem we can always
> address it then.

One other idea occurred to me that we could do, if we cared to solve
this hypothetical failure case. We could change the code to always
read the version 4 times on every probe. If one of the transfers fails
then we could consider that OK. If 2 or more transfers fails then we
could consider that to be an error. You still might get a _few_
unbind/bind in this hypothetical failure mode, but I think it would
catch the problem more quickly.

My probability theory is rusty and I'm sure there's a better way, but
I think we can just add up all the cases. Assuming a 10% failures and
90% success of any transfer:

# Chance of 2 failures:
.10 * .10 * .90 * .90 +
.10 * .90 * .10 * .90 +
.10 * .90 * .90 * .10 +
.90 * .10 * .90 * .10 +
.90 * .90 * .10 * .10

# Chance of 3 failures:
.10 * .10 * .10 * .90 +
.10 * .10 * .90 * .10 +
.10 * .90 * .10 * .10 +
.90 * .10 * .10 * .10

# Chance of 4 failures:
.10 * .10 * .10 * .10

If I add that up I get about a 4.4% chance of 2 or more failures in 4
reads. That means if we got into an unbind/bind cycle we'd get out of
it (on average) in ~23 probes because we'd see enough failures. We
could likely reduce this further by reading the version 5 or 6 times.

I will note that my measurements showed that a normal probe is ~200
transfers and also includes a bunch of delays, so reading the version
a few times wouldn't be a huge deal.


In any case, I'm still of the opinion that we don't need to handle this.

-Doug

2023-10-18 06:06:36

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] r8152: Block future register access if register access fails

On Tue, Oct 17, 2023 at 11:46 AM Doug Anderson <[email protected]> wrote:
>
> Hi,
>
> On Tue, Oct 17, 2023 at 7:17 AM Doug Anderson <[email protected]> wrote:
> >
> > Hi,
> >
> > On Tue, Oct 17, 2023 at 6:07 AM Hayes Wang <[email protected]> wrote:
> > >
> > > Doug Anderson <[email protected]>
> > > > Sent: Tuesday, October 17, 2023 12:47 AM
> > > [...
> > > > > > static int generic_ocp_read(struct r8152 *tp, u16 index, u16 size,
> > > > > > @@ -8265,6 +8353,19 @@ static int rtl8152_pre_reset(struct
> > > > usb_interface
> > > > > > *intf)
> > > > > > if (!tp)
> > > > > > return 0;
> > > > > >
> > > > > > + /* We can only use the optimized reset if we made it to the end of
> > > > > > + * probe without any register access fails, which sets
> > > > > > + * `PROBED_WITH_NO_ERRORS` to true. If we didn't have that then return
> > > > > > + * an error here which tells the USB framework to fully unbind/rebind
> > > > > > + * our driver.
> > > > >
> > > > > Would you stay in a loop of unbind and rebind,
> > > > > if the control transfers in the probe() are not always successful?
> > > > > I just think about the worst case that at least one control always fails in probe().
> > > >
> > > > We won't! :-) One of the first things that rtl8152_probe() does is to
> > > > call rtl8152_get_version(). That goes through to
> > > > rtl8152_get_version(). That function _doesn't_ queue up a reset if
> > > > there are communication problems, but it does do 3 retries of the
> > > > read. So if all 3 reads fail then we will permanently fail probe,
> > > > which I think is the correct thing to do.
> > >
> > > The probe() contains control transfers in
> > > 1. rtl8152_get_version()
> > > 2. tp->rtl_ops.init()
> > >
> > > If one of the 3 control transfers in 1) is successful AND
> > > any control transfer in 2) fails,
> > > you would queue a usb reset which would unbind/rebind the driver.
> > > Then, the loop starts.
> > > The loop would be broken, if and only if
> > > a) all control transfers in 1) fail, OR
> > > b) all control transfers in 2) succeed.
> > >
> > > That is, the loop would be broken when the fail rate of the control transfer is high or low enough.
> > > Otherwise, you would queue a usb reset again and again.
> > > For example, if the fail rate of the control transfer is 10% ~ 60%,
> > > I think you have high probability to keep the loop continually.
> > > Would it never happen?
> >
> > Actually, even with a failure rate of 10% I don't think you'll end up
> > with a fully continuous loop, right? All you need is to get 3 failures
> > in a row in rtl8152_get_version() to get out of the loop. So with a
> > 10% failure rate you'd unbind/bind 1000 times (on average) and then
> > (finally) give up. With a 50% failure rate I think you'd only
> > unbind/bind 8 times on average, right? Of course, I guess 1000 loops
> > is pretty close to infinite.
> >
> > In any case, we haven't actually seen hardware that fails like this.
> > We've seen failure rates that are much much lower and we can imagine
> > failure rates that are 100% if we're got really broken hardware. Do
> > you think cases where failure rates are middle-of-the-road are likely?
> >
> > I would also say that nothing we can do can perfectly handle faulty
> > hardware. If we're imagining theoretical hardware, we could imagine
> > theoretical hardware that de-enumerated itself and re-enumerated
> > itself every half second because the firmware on the device crashed or
> > some regulator kept dropping. This faulty hardware would also cause an
> > infinite loop of de-enumeration and re-enumeration, right?
> >
> > Presumably if we get into either case, the user will realize that the
> > hardware isn't working and will unplug it from the system. While the
> > system is doing the loop of trying to enumerate the hardware, it will
> > be taking up a bunch of extra CPU cycles but (I believe) it won't be
> > fully locked up or anything. The machine will still function and be
> > able to do non-Ethernet activities, right? I would say that the worst
> > thing about this state would be that it would stress corner cases in
> > the reset of the USB subsystem, possibly ticking bugs.
> >
> > So I guess I would summarize all the above as:
> >
> > If hardware is broken in just the right way then this patch could
> > cause a nearly infinite unbinding/rebinding of the r8152 driver.
> > However:
> >
> > 1. It doesn't seem terribly likely for hardware to be broken in just this way.
> >
> > 2. We haven't seen hardware broken in just this way.
> >
> > 3. Hardware broken in a slightly different way could cause infinite
> > unbinding/rebinding even without this patch.
> >
> > 4. Infinite unbinding/rebinding of a USB adapter isn't great, but not
> > the absolute worst thing.
> >
> >
> > That all being said, if we wanted to address this we could try two
> > different ways:
> >
> > a) We could add a global in the r8152 driver and limit the number of
> > times we reset. This gets a little ugly because if we have multiple
> > r8152 adapters plugged in then the same global would be used for both,
> > but maybe it's OK?
> >
> > b) We could improve the USB core to somehow prevent usb_reset_device()
> > from running too much on a given device?
> >
> >
> > ...though I would re-emphasize that I don't think this is something we
> > need to address now. If later we actually see a problem we can always
> > address it then.
>
> One other idea occurred to me that we could do, if we cared to solve
> this hypothetical failure case. We could change the code to always
> read the version 4 times on every probe. If one of the transfers fails
> then we could consider that OK. If 2 or more transfers fails then we
> could consider that to be an error. You still might get a _few_
> unbind/bind in this hypothetical failure mode, but I think it would
> catch the problem more quickly.
>
> My probability theory is rusty and I'm sure there's a better way, but
> I think we can just add up all the cases. Assuming a 10% failures and
> 90% success of any transfer:
>
> # Chance of 2 failures:
> .10 * .10 * .90 * .90 +
> .10 * .90 * .10 * .90 +
> .10 * .90 * .90 * .10 +
> .90 * .10 * .90 * .10 +
> .90 * .90 * .10 * .10
>
> # Chance of 3 failures:
> .10 * .10 * .10 * .90 +
> .10 * .10 * .90 * .10 +
> .10 * .90 * .10 * .10 +
> .90 * .10 * .10 * .10
>
> # Chance of 4 failures:
> .10 * .10 * .10 * .10
>
> If I add that up I get about a 4.4% chance of 2 or more failures in 4
> reads. That means if we got into an unbind/bind cycle we'd get out of
> it (on average) in ~23 probes because we'd see enough failures. We
> could likely reduce this further by reading the version 5 or 6 times.
>
> I will note that my measurements showed that a normal probe is ~200
> transfers and also includes a bunch of delays, so reading the version
> a few times wouldn't be a huge deal.
>
>
> In any case, I'm still of the opinion that we don't need to handle this.

Hayes,
As Doug points out the probability is really low of this happening for
an event that is already rare. Doug's patch is a very good step in the
right direction (driver robustness) and I think has been tested by
Chromium OS team enough that it is safe to apply to the upstream tree.
I'm a big fan of taking small steps where we can. We can further
improve on this in the future as needed.

Please add:
Reviewed-by: Grant Grundler <[email protected]>

cheers,
grant

>
> -Doug

2023-10-18 11:41:51

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH v3 5/5] r8152: Block future register access if register access fails

Doug Anderson <[email protected]>
> Sent: Tuesday, October 17, 2023 10:17 PM
[...]
> > That is, the loop would be broken when the fail rate of the control transfer is high or low enough.
> > Otherwise, you would queue a usb reset again and again.
> > For example, if the fail rate of the control transfer is 10% ~ 60%,
> > I think you have high probability to keep the loop continually.
> > Would it never happen?
>
> Actually, even with a failure rate of 10% I don't think you'll end up
> with a fully continuous loop, right? All you need is to get 3 failures
> in a row in rtl8152_get_version() to get out of the loop. So with a
> 10% failure rate you'd unbind/bind 1000 times (on average) and then
> (finally) give up. With a 50% failure rate I think you'd only
> unbind/bind 8 times on average, right? Of course, I guess 1000 loops
> is pretty close to infinite.
>
> In any case, we haven't actually seen hardware that fails like this.
> We've seen failure rates that are much much lower and we can imagine
> failure rates that are 100% if we're got really broken hardware. Do
> you think cases where failure rates are middle-of-the-road are likely?

That is my question, too.
I don't know if something would cause the situation, either.
This is out of my knowledge.
I am waiting for the professional answers, too.

A lot of reasons may cause the fail of the control transfer.
I don't have all of the real situation to analyze them.
Therefore, what I could do is to assume different situations.
You could say my hypotheses are unreasonable.
However, I have to tell you what I worry.

> I would also say that nothing we can do can perfectly handle faulty
> hardware. If we're imagining theoretical hardware, we could imagine
> theoretical hardware that de-enumerated itself and re-enumerated
> itself every half second because the firmware on the device crashed or
> some regulator kept dropping. This faulty hardware would also cause an
> infinite loop of de-enumeration and re-enumeration, right?
>
> Presumably if we get into either case, the user will realize that the
> hardware isn't working and will unplug it from the system. While the

Some of our devices are onboard. That is, they couldn't be unplugged.
That is why I have to consider a lot of situations.

> system is doing the loop of trying to enumerate the hardware, it will
> be taking up a bunch of extra CPU cycles but (I believe) it won't be
> fully locked up or anything. The machine will still function and be
> able to do non-Ethernet activities, right? I would say that the worst
> thing about this state would be that it would stress corner cases in
> the reset of the USB subsystem, possibly ticking bugs.
>
> So I guess I would summarize all the above as:
>
> If hardware is broken in just the right way then this patch could
> cause a nearly infinite unbinding/rebinding of the r8152 driver.
> However:
>
> 1. It doesn't seem terribly likely for hardware to be broken in just this way.
>
> 2. We haven't seen hardware broken in just this way.
>
> 3. Hardware broken in a slightly different way could cause infinite
> unbinding/rebinding even without this patch.
>
> 4. Infinite unbinding/rebinding of a USB adapter isn't great, but not
> the absolute worst thing.

It is fine if everyone agrees these.

Best Regards,
Hayes

2023-10-18 12:01:54

by Hayes Wang

[permalink] [raw]
Subject: RE: [PATCH v3 5/5] r8152: Block future register access if register access fails

Grant Grundler <[email protected]>
> Sent: Wednesday, October 18, 2023 2:06 PM
[...]
> Hayes,
> As Doug points out the probability is really low of this happening for
> an event that is already rare. Doug's patch is a very good step in the
> right direction (driver robustness) and I think has been tested by
> Chromium OS team enough that it is safe to apply to the upstream tree.
> I'm a big fan of taking small steps where we can. We can further
> improve on this in the future as needed.

I don't reject the patch. And, I don't have the right to reject or apply the patch.
I just don't wish the patch to trouble the others. And, I need the professional
people to check me views, too.

I think someone would determine whether the patch could be applied, or not.

Best Regards,
Hayes

> Please add:
> Reviewed-by: Grant Grundler <[email protected]>
>
> cheers,
> grant
>
> >
> > -Doug

2023-10-19 15:43:25

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] r8152: Block future register access if register access fails

Hi,

On Wed, Oct 18, 2023 at 4:41 AM Hayes Wang <[email protected]> wrote:
>
> > In any case, we haven't actually seen hardware that fails like this.
> > We've seen failure rates that are much much lower and we can imagine
> > failure rates that are 100% if we're got really broken hardware. Do
> > you think cases where failure rates are middle-of-the-road are likely?
>
> That is my question, too.
> I don't know if something would cause the situation, either.
> This is out of my knowledge.
> I am waiting for the professional answers, too.
>
> A lot of reasons may cause the fail of the control transfer.
> I don't have all of the real situation to analyze them.
> Therefore, what I could do is to assume different situations.
> You could say my hypotheses are unreasonable.
> However, I have to tell you what I worry.

Of course! ...and I appreciate your thoughts on the topic. The more
eyes on a patch the more problems that are caught. Unless someone
disagrees, I think we at least have ideas for how this could be
addressed if it comes up. Also unless someone disagrees, I think that
if this does come up in some situation it won't be a catastrophe.

Given how things look now, I'm going to plan to send a new version of
the patch later today. Though the commit message is long, I'll add a
little more to talk about this case and point to ideas for how it
could be solved if it comes up.


> > I would also say that nothing we can do can perfectly handle faulty
> > hardware. If we're imagining theoretical hardware, we could imagine
> > theoretical hardware that de-enumerated itself and re-enumerated
> > itself every half second because the firmware on the device crashed or
> > some regulator kept dropping. This faulty hardware would also cause an
> > infinite loop of de-enumeration and re-enumeration, right?
> >
> > Presumably if we get into either case, the user will realize that the
> > hardware isn't working and will unplug it from the system. While the
>
> Some of our devices are onboard. That is, they couldn't be unplugged.
> That is why I have to consider a lot of situations.

Good point! I think even with onboard devices we could already have
preexisting conditions that could cause an unbind/rebind loop. This
would be a new condition, of course.


-Doug