2022-02-16 16:13:07

by Alvin Šipraga

[permalink] [raw]
Subject: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

From: Alvin Šipraga <[email protected]>

These two patches fix the issue reported by Arınç where PHY register
reads sometimes return garbage data.

MAINTAINERS: Please can you help me with the targetting of these two
patches? This bug is present ca. 5.16, when the SMI version of the
rtl8365mb driver was introduced. But now in net-next we have the MDIO
interface from Luiz, where the issue is also present. I am sending what
I think is an ideal patch series, but should I split it up and send the
SMI-related changes to net and the MDIO changes to net-next? If so, how
would I go about splitting it while preventing merge conflicts and build
errors?

For now I am sending it to net-next so that the whole thing can be
reviewed. If it's applied, I would gladly backport the fix to the stable
tree for 5.16, but I am still confused about what to do for 5.17.

Thanks for your help.


Alvin Šipraga (2):
net: dsa: realtek: allow subdrivers to externally lock regmap
net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

drivers/net/dsa/realtek/realtek-mdio.c | 46 +++++++++++++++++++++-
drivers/net/dsa/realtek/realtek-smi.c | 48 +++++++++++++++++++++--
drivers/net/dsa/realtek/realtek.h | 2 +
drivers/net/dsa/realtek/rtl8365mb.c | 54 ++++++++++++++++----------
4 files changed, 124 insertions(+), 26 deletions(-)

--
2.35.0


2022-02-16 16:58:47

by Alvin Šipraga

[permalink] [raw]
Subject: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

From: Alvin Šipraga <[email protected]>

Realtek switches in the rtl8365mb family can access the PHY registers of
the internal PHYs via the switch registers. This method is called
indirect access. At a high level, the indirect PHY register access
method involves reading and writing some special switch registers in a
particular sequence. This works for both SMI and MDIO connected
switches.

Currently the rtl8365mb driver does not take any care to serialize the
aforementioned access to the switch registers. In particular, it is
permitted for other driver code to access other switch registers while
the indirect PHY register access is ongoing. Locking is only done at the
regmap level. This, however, is a bug: concurrent register access, even
to unrelated switch registers, risks corrupting the PHY register value
read back via the indirect access method described above.

Arınç reported that the switch sometimes returns nonsense data when
reading the PHY registers. In particular, a value of 0 causes the
kernel's PHY subsystem to think that the link is down, but since most
reads return correct data, the link then flip-flops between up and down
over a period of time.

The aforementioned bug can be readily observed by:

1. Enabling ftrace events for regmap and mdio
2. Polling BSMR PHY register for a connected port;
it should always read the same (e.g. 0x79ed)
3. Wait for step 2 to give a different value

Example command for step 2:

while true; do phytool read swp2/2/0x01; done

On my i.MX8MM, the above steps will yield a bogus value for the BSMR PHY
register within a matter of seconds. The interleaved register access it
then evident in the trace log:

kworker/3:4-70 [003] ....... 1927.139849: regmap_reg_write: ethernet-switch reg=1004 val=bd
phytool-16816 [002] ....... 1927.139979: regmap_reg_read: ethernet-switch reg=1f01 val=0
kworker/3:4-70 [003] ....... 1927.140381: regmap_reg_read: ethernet-switch reg=1005 val=0
phytool-16816 [002] ....... 1927.140468: regmap_reg_read: ethernet-switch reg=1d15 val=a69
kworker/3:4-70 [003] ....... 1927.140864: regmap_reg_read: ethernet-switch reg=1003 val=0
phytool-16816 [002] ....... 1927.140955: regmap_reg_write: ethernet-switch reg=1f02 val=2041
kworker/3:4-70 [003] ....... 1927.141390: regmap_reg_read: ethernet-switch reg=1002 val=0
phytool-16816 [002] ....... 1927.141479: regmap_reg_write: ethernet-switch reg=1f00 val=1
kworker/3:4-70 [003] ....... 1927.142311: regmap_reg_write: ethernet-switch reg=1004 val=be
phytool-16816 [002] ....... 1927.142410: regmap_reg_read: ethernet-switch reg=1f01 val=0
kworker/3:4-70 [003] ....... 1927.142534: regmap_reg_read: ethernet-switch reg=1005 val=0
phytool-16816 [002] ....... 1927.142618: regmap_reg_read: ethernet-switch reg=1f04 val=0
phytool-16816 [002] ....... 1927.142641: mdio_access: SMI-0 read phy:0x02 reg:0x01 val:0x0000 <- ?!
kworker/3:4-70 [003] ....... 1927.143037: regmap_reg_read: ethernet-switch reg=1001 val=0
kworker/3:4-70 [003] ....... 1927.143133: regmap_reg_read: ethernet-switch reg=1000 val=2d89
kworker/3:4-70 [003] ....... 1927.143213: regmap_reg_write: ethernet-switch reg=1004 val=be
kworker/3:4-70 [003] ....... 1927.143291: regmap_reg_read: ethernet-switch reg=1005 val=0
kworker/3:4-70 [003] ....... 1927.143368: regmap_reg_read: ethernet-switch reg=1003 val=0
kworker/3:4-70 [003] ....... 1927.143443: regmap_reg_read: ethernet-switch reg=1002 val=6

The kworker here is polling MIB counters for stats, as evidenced by the
register 0x1004 that we are writing to (RTL8365MB_MIB_ADDRESS_REG). This
polling is performed every 3 seconds, but is just one example of such
unsynchronized access.

Further investigation reveals the underlying problem: if we read from an
arbitrary register A and this read coincides with the indirect access
method in rtl8365mb_phy_ocp_read, then the final read from
RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
register A. The value read back can be readily poisoned by repeatedly
reading back the value of another register A via debugfs in a busy loop
via the dd utility or similar.

This issue appears to be unique to the indirect PHY register access
pattern. In particular, it does not seem to impact similar sequential
register operations such MIB counter access.

To fix this problem, one must guard against exactly the scenario seen in
the above trace. In particular, other parts of the driver using the
regmap API must not be permitted to access the switch registers until
the PHY register access is complete. Fix this by using the newly
introduced "nolock" regmap in all PHY-related functions, and by aquiring
the regmap mutex at the top level of the PHY register access callbacks.
Although no issue has been observed with PHY register _writes_, this
change also serializes the indirect access method there. This is done
purely as a matter of convenience.

Fixes: 4af2950c50c8 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
Link: https://lore.kernel.org/netdev/CAJq09z5FCgG-+jVT7uxh1a-0CiiFsoKoHYsAWJtiKwv7LXKofQ@mail.gmail.com/
Reported-by: Arınç ÜNAL <[email protected]>
Reported-by: Luiz Angelo Daros de Luca <[email protected]>
Signed-off-by: Alvin Šipraga <[email protected]>
---
drivers/net/dsa/realtek/rtl8365mb.c | 54 ++++++++++++++++++-----------
1 file changed, 33 insertions(+), 21 deletions(-)

diff --git a/drivers/net/dsa/realtek/rtl8365mb.c b/drivers/net/dsa/realtek/rtl8365mb.c
index 2ed592147c20..c39d6b744597 100644
--- a/drivers/net/dsa/realtek/rtl8365mb.c
+++ b/drivers/net/dsa/realtek/rtl8365mb.c
@@ -590,7 +590,7 @@ static int rtl8365mb_phy_poll_busy(struct realtek_priv *priv)
{
u32 val;

- return regmap_read_poll_timeout(priv->map,
+ return regmap_read_poll_timeout(priv->map_nolock,
RTL8365MB_INDIRECT_ACCESS_STATUS_REG,
val, !val, 10, 100);
}
@@ -604,7 +604,7 @@ static int rtl8365mb_phy_ocp_prepare(struct realtek_priv *priv, int phy,
/* Set OCP prefix */
val = FIELD_GET(RTL8365MB_PHY_OCP_ADDR_PREFIX_MASK, ocp_addr);
ret = regmap_update_bits(
- priv->map, RTL8365MB_GPHY_OCP_MSB_0_REG,
+ priv->map_nolock, RTL8365MB_GPHY_OCP_MSB_0_REG,
RTL8365MB_GPHY_OCP_MSB_0_CFG_CPU_OCPADR_MASK,
FIELD_PREP(RTL8365MB_GPHY_OCP_MSB_0_CFG_CPU_OCPADR_MASK, val));
if (ret)
@@ -617,8 +617,8 @@ static int rtl8365mb_phy_ocp_prepare(struct realtek_priv *priv, int phy,
ocp_addr >> 1);
val |= FIELD_PREP(RTL8365MB_INDIRECT_ACCESS_ADDRESS_OCPADR_9_6_MASK,
ocp_addr >> 6);
- ret = regmap_write(priv->map, RTL8365MB_INDIRECT_ACCESS_ADDRESS_REG,
- val);
+ ret = regmap_write(priv->map_nolock,
+ RTL8365MB_INDIRECT_ACCESS_ADDRESS_REG, val);
if (ret)
return ret;

@@ -631,36 +631,42 @@ static int rtl8365mb_phy_ocp_read(struct realtek_priv *priv, int phy,
u32 val;
int ret;

+ mutex_lock(&priv->map_lock);
+
ret = rtl8365mb_phy_poll_busy(priv);
if (ret)
- return ret;
+ goto out;

ret = rtl8365mb_phy_ocp_prepare(priv, phy, ocp_addr);
if (ret)
- return ret;
+ goto out;

/* Execute read operation */
val = FIELD_PREP(RTL8365MB_INDIRECT_ACCESS_CTRL_CMD_MASK,
RTL8365MB_INDIRECT_ACCESS_CTRL_CMD_VALUE) |
FIELD_PREP(RTL8365MB_INDIRECT_ACCESS_CTRL_RW_MASK,
RTL8365MB_INDIRECT_ACCESS_CTRL_RW_READ);
- ret = regmap_write(priv->map, RTL8365MB_INDIRECT_ACCESS_CTRL_REG, val);
+ ret = regmap_write(priv->map_nolock, RTL8365MB_INDIRECT_ACCESS_CTRL_REG,
+ val);
if (ret)
- return ret;
+ goto out;

ret = rtl8365mb_phy_poll_busy(priv);
if (ret)
- return ret;
+ goto out;

/* Get PHY register data */
- ret = regmap_read(priv->map, RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG,
- &val);
+ ret = regmap_read(priv->map_nolock,
+ RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG, &val);
if (ret)
- return ret;
+ goto out;

*data = val & 0xFFFF;

- return 0;
+out:
+ mutex_unlock(&priv->map_lock);
+
+ return ret;
}

static int rtl8365mb_phy_ocp_write(struct realtek_priv *priv, int phy,
@@ -669,32 +675,38 @@ static int rtl8365mb_phy_ocp_write(struct realtek_priv *priv, int phy,
u32 val;
int ret;

+ mutex_lock(&priv->map_lock);
+
ret = rtl8365mb_phy_poll_busy(priv);
if (ret)
- return ret;
+ goto out;

ret = rtl8365mb_phy_ocp_prepare(priv, phy, ocp_addr);
if (ret)
- return ret;
+ goto out;

/* Set PHY register data */
- ret = regmap_write(priv->map, RTL8365MB_INDIRECT_ACCESS_WRITE_DATA_REG,
- data);
+ ret = regmap_write(priv->map_nolock,
+ RTL8365MB_INDIRECT_ACCESS_WRITE_DATA_REG, data);
if (ret)
- return ret;
+ goto out;

/* Execute write operation */
val = FIELD_PREP(RTL8365MB_INDIRECT_ACCESS_CTRL_CMD_MASK,
RTL8365MB_INDIRECT_ACCESS_CTRL_CMD_VALUE) |
FIELD_PREP(RTL8365MB_INDIRECT_ACCESS_CTRL_RW_MASK,
RTL8365MB_INDIRECT_ACCESS_CTRL_RW_WRITE);
- ret = regmap_write(priv->map, RTL8365MB_INDIRECT_ACCESS_CTRL_REG, val);
+ ret = regmap_write(priv->map_nolock, RTL8365MB_INDIRECT_ACCESS_CTRL_REG,
+ val);
if (ret)
- return ret;
+ goto out;

ret = rtl8365mb_phy_poll_busy(priv);
if (ret)
- return ret;
+ goto out;
+
+out:
+ mutex_unlock(&priv->map_lock);

return 0;
}
--
2.35.0

2022-02-16 19:32:21

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

Luiz Angelo Daros de Luca <[email protected]> writes:

>> These two patches fix the issue reported by Arınç where PHY register
>> reads sometimes return garbage data.
>>
>> MAINTAINERS: Please can you help me with the targetting of these two
>> patches? This bug is present ca. 5.16, when the SMI version of the
>> rtl8365mb driver was introduced. But now in net-next we have the MDIO
>> interface from Luiz, where the issue is also present. I am sending what
>> I think is an ideal patch series, but should I split it up and send the
>> SMI-related changes to net and the MDIO changes to net-next? If so, how
>> would I go about splitting it while preventing merge conflicts and build
>> errors?
>>
>> For now I am sending it to net-next so that the whole thing can be
>> reviewed. If it's applied, I would gladly backport the fix to the stable
>> tree for 5.16, but I am still confused about what to do for 5.17.
>>
>> Thanks for your help.
>>
>>
>> Alvin Šipraga (2):
>> net: dsa: realtek: allow subdrivers to externally lock regmap
>> net: dsa: realtek: rtl8365mb: serialize indirect PHY register access
>>
>> drivers/net/dsa/realtek/realtek-mdio.c | 46 +++++++++++++++++++++-
>> drivers/net/dsa/realtek/realtek-smi.c | 48 +++++++++++++++++++++--
>> drivers/net/dsa/realtek/realtek.h | 2 +
>> drivers/net/dsa/realtek/rtl8365mb.c | 54 ++++++++++++++++----------
>> 4 files changed, 124 insertions(+), 26 deletions(-)
>>
>> --
>> 2.35.0
>>
>
> Thanks for the fix, Alvin.
>
> I still feel like we are trying to go around a regmap limitation
> instead of fixing it there. If we control regmap lock (we can define a
> custom lock/unlock) and create new regmap_{read,write}_nolock
> variants, we'll just need to lock the regmap, do whatever you need,
> and unlock it.

Can you show me what those regmap_{read,write}_nolock variants would
look like in your example? And what about the other regmap_ APIs we use,
like regmap_read_poll_timeout, regmap_update_bits, etc. - do you propose
to reimplement all of these?

>
> BTW, I believe that, for realtek-mdio, a regmap custom lock mechanism
> could simply use mdio lock while realtek-smi already has priv->lock.

Hmm OK. Actually I'm a bit confused about the mdio_lock: can you explain
what it's guarding against, for someone unfamiliar with MDIO? Currently
realtek-mdio's regmap has an additional lock around it (disable_locking
is 0), so with these patches applied the number of locks remains the
same.

priv->lock is a spinlock which is inappropriate here. I'm not really
sure what the point of it is, besides to handle unlocked calls to the
_noack function. It might be removable altogether but I would prefer not
to touch it for this series.

Kind regards,
Alvin

2022-02-16 19:38:55

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

Andrew Lunn <[email protected]> writes:
>> Hmm OK. Actually I'm a bit confused about the mdio_lock: can you explain
>> what it's guarding against, for someone unfamiliar with MDIO?
>
> The more normal use case for MDIO is for PHYs, not switches. There can
> be multiple PHYs on one MDIO bus. And these PHYs each have there own
> state machine in phylib. At any point in time, that state machine can
> request the driver to do something, like poll the PHY status, does it
> have link? To prevent two PHY drivers trying to use the MDIO bus at
> the same time, there is an MDIO lock. At the beginning of an MDIO
> transaction, the lock is taken. And the end of the transaction,
> reading or writing one register of a device on the bus, the lock is
> released.
>
> So the MDIO lock simply ensures there is only one user of the MDIO bus
> at one time, for a single read or write.
>
> For PHYs this is sufficient. For switches, sometimes you need
> additional protection. The granularity of an access might not be a
> single register read or a write. It could be you need to read or write
> a few registers in an atomic way. If that is the case, you need a lock
> at a higher level.

Thank you Andrew for the clear explanation.

Somewhat unrelated to this series, but are you able to explain to me the
difference between:

mutex_lock(&bus->mdio_lock);
and
mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);

While looking at other driver examples I noticed the latter form quite a
few times too.

Kind regards,
Alvin

2022-02-16 22:48:44

by Alvin Šipraga

[permalink] [raw]
Subject: [PATCH net-next 1/2] net: dsa: realtek: allow subdrivers to externally lock regmap

From: Alvin Šipraga <[email protected]>

Currently there is no way for Realtek DSA subdrivers to serialize
consecutive regmap accesses. In preparation for a bugfix relating to
indirect PHY register access - which involves a series of regmap
reads and writes - add a facility for subdrivers to serialize their
regmap access.

Specifically, a mutex is added to the driver private data structure and
the standard regmap is initialized with custom lock/unlock ops which use
this mutex. Then, a "nolock" variant of the regmap is added, which is
functionally equivalent to the existing regmap except that regmap
locking is disabled. Functions that wish to serialize a sequence of
regmap accesses may then lock the newly introduced driver-owned mutex
before using the nolock regmap.

Doing things this way means that subdriver code that doesn't care about
serialized register access - i.e. the vast majority of code - needn't
worry about synchronizing register access with an external lock: it can
just continue to use the original regmap.

Another advantage of this design is that, while regmaps with locking
disabled do not expose a debugfs interface for obvious reasons, there
still exists the original regmap which does expose this interface. This
interface remains safe to use even combined with driver codepaths that
use the nolock regmap, because said codepaths will use the same mutex
to synchronize access.

With respect to disadvantages, it can be argued that having
near-duplicate regmaps is confusing. However, the naming is rather
explicit, and examples will abound.

Finally, while we are at it, rename realtek_smi_mdio_regmap_config to
realtek_smi_regmap_config. This makes it consistent with the naming
realtek_mdio_regmap_config in realtek-mdio.c.

Signed-off-by: Alvin Šipraga <[email protected]>
---
drivers/net/dsa/realtek/realtek-mdio.c | 46 ++++++++++++++++++++++--
drivers/net/dsa/realtek/realtek-smi.c | 48 ++++++++++++++++++++++++--
drivers/net/dsa/realtek/realtek.h | 2 ++
3 files changed, 91 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/realtek/realtek-mdio.c b/drivers/net/dsa/realtek/realtek-mdio.c
index 0308be95d00a..31e1f100e48e 100644
--- a/drivers/net/dsa/realtek/realtek-mdio.c
+++ b/drivers/net/dsa/realtek/realtek-mdio.c
@@ -98,6 +98,20 @@ static int realtek_mdio_read(void *ctx, u32 reg, u32 *val)
return ret;
}

+static void realtek_mdio_lock(void *ctx)
+{
+ struct realtek_priv *priv = ctx;
+
+ mutex_lock(&priv->map_lock);
+}
+
+static void realtek_mdio_unlock(void *ctx)
+{
+ struct realtek_priv *priv = ctx;
+
+ mutex_unlock(&priv->map_lock);
+}
+
static const struct regmap_config realtek_mdio_regmap_config = {
.reg_bits = 10, /* A4..A0 R4..R0 */
.val_bits = 16,
@@ -108,6 +122,21 @@ static const struct regmap_config realtek_mdio_regmap_config = {
.reg_read = realtek_mdio_read,
.reg_write = realtek_mdio_write,
.cache_type = REGCACHE_NONE,
+ .lock = realtek_mdio_lock,
+ .unlock = realtek_mdio_unlock,
+};
+
+static const struct regmap_config realtek_mdio_nolock_regmap_config = {
+ .reg_bits = 10, /* A4..A0 R4..R0 */
+ .val_bits = 16,
+ .reg_stride = 1,
+ /* PHY regs are at 0x8000 */
+ .max_register = 0xffff,
+ .reg_format_endian = REGMAP_ENDIAN_BIG,
+ .reg_read = realtek_mdio_read,
+ .reg_write = realtek_mdio_write,
+ .cache_type = REGCACHE_NONE,
+ .disable_locking = true,
};

static int realtek_mdio_probe(struct mdio_device *mdiodev)
@@ -115,8 +144,9 @@ static int realtek_mdio_probe(struct mdio_device *mdiodev)
struct realtek_priv *priv;
struct device *dev = &mdiodev->dev;
const struct realtek_variant *var;
- int ret;
+ struct regmap_config rc;
struct device_node *np;
+ int ret;

var = of_device_get_match_data(dev);
if (!var)
@@ -126,13 +156,25 @@ static int realtek_mdio_probe(struct mdio_device *mdiodev)
if (!priv)
return -ENOMEM;

- priv->map = devm_regmap_init(dev, NULL, priv, &realtek_mdio_regmap_config);
+ mutex_init(&priv->map_lock);
+
+ rc = realtek_mdio_regmap_config;
+ rc.lock_arg = priv;
+ priv->map = devm_regmap_init(dev, NULL, priv, &rc);
if (IS_ERR(priv->map)) {
ret = PTR_ERR(priv->map);
dev_err(dev, "regmap init failed: %d\n", ret);
return ret;
}

+ rc = realtek_mdio_nolock_regmap_config;
+ priv->map_nolock = devm_regmap_init(dev, NULL, priv, &rc);
+ if (IS_ERR(priv->map_nolock)) {
+ ret = PTR_ERR(priv->map_nolock);
+ dev_err(dev, "regmap init failed: %d\n", ret);
+ return ret;
+ }
+
priv->mdio_addr = mdiodev->addr;
priv->bus = mdiodev->bus;
priv->dev = &mdiodev->dev;
diff --git a/drivers/net/dsa/realtek/realtek-smi.c b/drivers/net/dsa/realtek/realtek-smi.c
index 8806b74bd7a8..2243d3da55b2 100644
--- a/drivers/net/dsa/realtek/realtek-smi.c
+++ b/drivers/net/dsa/realtek/realtek-smi.c
@@ -311,7 +311,21 @@ static int realtek_smi_read(void *ctx, u32 reg, u32 *val)
return realtek_smi_read_reg(priv, reg, val);
}

-static const struct regmap_config realtek_smi_mdio_regmap_config = {
+static void realtek_smi_lock(void *ctx)
+{
+ struct realtek_priv *priv = ctx;
+
+ mutex_lock(&priv->map_lock);
+}
+
+static void realtek_smi_unlock(void *ctx)
+{
+ struct realtek_priv *priv = ctx;
+
+ mutex_unlock(&priv->map_lock);
+}
+
+static const struct regmap_config realtek_smi_regmap_config = {
.reg_bits = 10, /* A4..A0 R4..R0 */
.val_bits = 16,
.reg_stride = 1,
@@ -321,6 +335,21 @@ static const struct regmap_config realtek_smi_mdio_regmap_config = {
.reg_read = realtek_smi_read,
.reg_write = realtek_smi_write,
.cache_type = REGCACHE_NONE,
+ .lock = realtek_smi_lock,
+ .unlock = realtek_smi_unlock,
+};
+
+static const struct regmap_config realtek_smi_nolock_regmap_config = {
+ .reg_bits = 10, /* A4..A0 R4..R0 */
+ .val_bits = 16,
+ .reg_stride = 1,
+ /* PHY regs are at 0x8000 */
+ .max_register = 0xffff,
+ .reg_format_endian = REGMAP_ENDIAN_BIG,
+ .reg_read = realtek_smi_read,
+ .reg_write = realtek_smi_write,
+ .cache_type = REGCACHE_NONE,
+ .disable_locking = true,
};

static int realtek_smi_mdio_read(struct mii_bus *bus, int addr, int regnum)
@@ -385,6 +414,7 @@ static int realtek_smi_probe(struct platform_device *pdev)
const struct realtek_variant *var;
struct device *dev = &pdev->dev;
struct realtek_priv *priv;
+ struct regmap_config rc;
struct device_node *np;
int ret;

@@ -395,14 +425,26 @@ static int realtek_smi_probe(struct platform_device *pdev)
if (!priv)
return -ENOMEM;
priv->chip_data = (void *)priv + sizeof(*priv);
- priv->map = devm_regmap_init(dev, NULL, priv,
- &realtek_smi_mdio_regmap_config);
+
+ mutex_init(&priv->map_lock);
+
+ rc = realtek_smi_regmap_config;
+ rc.lock_arg = priv;
+ priv->map = devm_regmap_init(dev, NULL, priv, &rc);
if (IS_ERR(priv->map)) {
ret = PTR_ERR(priv->map);
dev_err(dev, "regmap init failed: %d\n", ret);
return ret;
}

+ rc = realtek_smi_nolock_regmap_config;
+ priv->map_nolock = devm_regmap_init(dev, NULL, priv, &rc);
+ if (IS_ERR(priv->map_nolock)) {
+ ret = PTR_ERR(priv->map_nolock);
+ dev_err(dev, "regmap init failed: %d\n", ret);
+ return ret;
+ }
+
/* Link forward and backward */
priv->dev = dev;
priv->clk_delay = var->clk_delay;
diff --git a/drivers/net/dsa/realtek/realtek.h b/drivers/net/dsa/realtek/realtek.h
index e7d3e1bcf8b8..4fa7c6ba874a 100644
--- a/drivers/net/dsa/realtek/realtek.h
+++ b/drivers/net/dsa/realtek/realtek.h
@@ -52,6 +52,8 @@ struct realtek_priv {
struct gpio_desc *mdc;
struct gpio_desc *mdio;
struct regmap *map;
+ struct regmap *map_nolock;
+ struct mutex map_lock;
struct mii_bus *slave_mii_bus;
struct mii_bus *bus;
int mdio_addr;
--
2.35.0

Subject: Re: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

> These two patches fix the issue reported by Arınç where PHY register
> reads sometimes return garbage data.
>
> MAINTAINERS: Please can you help me with the targetting of these two
> patches? This bug is present ca. 5.16, when the SMI version of the
> rtl8365mb driver was introduced. But now in net-next we have the MDIO
> interface from Luiz, where the issue is also present. I am sending what
> I think is an ideal patch series, but should I split it up and send the
> SMI-related changes to net and the MDIO changes to net-next? If so, how
> would I go about splitting it while preventing merge conflicts and build
> errors?
>
> For now I am sending it to net-next so that the whole thing can be
> reviewed. If it's applied, I would gladly backport the fix to the stable
> tree for 5.16, but I am still confused about what to do for 5.17.
>
> Thanks for your help.
>
>
> Alvin Šipraga (2):
> net: dsa: realtek: allow subdrivers to externally lock regmap
> net: dsa: realtek: rtl8365mb: serialize indirect PHY register access
>
> drivers/net/dsa/realtek/realtek-mdio.c | 46 +++++++++++++++++++++-
> drivers/net/dsa/realtek/realtek-smi.c | 48 +++++++++++++++++++++--
> drivers/net/dsa/realtek/realtek.h | 2 +
> drivers/net/dsa/realtek/rtl8365mb.c | 54 ++++++++++++++++----------
> 4 files changed, 124 insertions(+), 26 deletions(-)
>
> --
> 2.35.0
>

Thanks for the fix, Alvin.

I still feel like we are trying to go around a regmap limitation
instead of fixing it there. If we control regmap lock (we can define a
custom lock/unlock) and create new regmap_{read,write}_nolock
variants, we'll just need to lock the regmap, do whatever you need,
and unlock it.

BTW, I believe that, for realtek-mdio, a regmap custom lock mechanism
could simply use mdio lock while realtek-smi already has priv->lock.

Regards,

2022-02-17 01:40:43

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

> Hmm OK. Actually I'm a bit confused about the mdio_lock: can you explain
> what it's guarding against, for someone unfamiliar with MDIO?

The more normal use case for MDIO is for PHYs, not switches. There can
be multiple PHYs on one MDIO bus. And these PHYs each have there own
state machine in phylib. At any point in time, that state machine can
request the driver to do something, like poll the PHY status, does it
have link? To prevent two PHY drivers trying to use the MDIO bus at
the same time, there is an MDIO lock. At the beginning of an MDIO
transaction, the lock is taken. And the end of the transaction,
reading or writing one register of a device on the bus, the lock is
released.

So the MDIO lock simply ensures there is only one user of the MDIO bus
at one time, for a single read or write.

For PHYs this is sufficient. For switches, sometimes you need
additional protection. The granularity of an access might not be a
single register read or a write. It could be you need to read or write
a few registers in an atomic way. If that is the case, you need a lock
at a higher level.

Andrew

2022-02-17 03:16:33

by Vladimir Oltean

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

On Wed, Feb 16, 2022 at 05:05:00PM +0100, Alvin Šipraga wrote:
> From: Alvin Šipraga <[email protected]>
>
> Realtek switches in the rtl8365mb family can access the PHY registers of
> the internal PHYs via the switch registers. This method is called
> indirect access. At a high level, the indirect PHY register access
> method involves reading and writing some special switch registers in a
> particular sequence. This works for both SMI and MDIO connected
> switches.
>
> Currently the rtl8365mb driver does not take any care to serialize the
> aforementioned access to the switch registers. In particular, it is
> permitted for other driver code to access other switch registers while
> the indirect PHY register access is ongoing. Locking is only done at the
> regmap level. This, however, is a bug: concurrent register access, even
> to unrelated switch registers, risks corrupting the PHY register value
> read back via the indirect access method described above.
>
> Arınç reported that the switch sometimes returns nonsense data when
> reading the PHY registers. In particular, a value of 0 causes the
> kernel's PHY subsystem to think that the link is down, but since most
> reads return correct data, the link then flip-flops between up and down
> over a period of time.
>
> The aforementioned bug can be readily observed by:
>
> 1. Enabling ftrace events for regmap and mdio
> 2. Polling BSMR PHY register for a connected port;
> it should always read the same (e.g. 0x79ed)
> 3. Wait for step 2 to give a different value
>
> Example command for step 2:
>
> while true; do phytool read swp2/2/0x01; done
>
> On my i.MX8MM, the above steps will yield a bogus value for the BSMR PHY
> register within a matter of seconds. The interleaved register access it
> then evident in the trace log:
>
> kworker/3:4-70 [003] ....... 1927.139849: regmap_reg_write: ethernet-switch reg=1004 val=bd
> phytool-16816 [002] ....... 1927.139979: regmap_reg_read: ethernet-switch reg=1f01 val=0
> kworker/3:4-70 [003] ....... 1927.140381: regmap_reg_read: ethernet-switch reg=1005 val=0
> phytool-16816 [002] ....... 1927.140468: regmap_reg_read: ethernet-switch reg=1d15 val=a69
> kworker/3:4-70 [003] ....... 1927.140864: regmap_reg_read: ethernet-switch reg=1003 val=0
> phytool-16816 [002] ....... 1927.140955: regmap_reg_write: ethernet-switch reg=1f02 val=2041
> kworker/3:4-70 [003] ....... 1927.141390: regmap_reg_read: ethernet-switch reg=1002 val=0
> phytool-16816 [002] ....... 1927.141479: regmap_reg_write: ethernet-switch reg=1f00 val=1
> kworker/3:4-70 [003] ....... 1927.142311: regmap_reg_write: ethernet-switch reg=1004 val=be
> phytool-16816 [002] ....... 1927.142410: regmap_reg_read: ethernet-switch reg=1f01 val=0
> kworker/3:4-70 [003] ....... 1927.142534: regmap_reg_read: ethernet-switch reg=1005 val=0
> phytool-16816 [002] ....... 1927.142618: regmap_reg_read: ethernet-switch reg=1f04 val=0
> phytool-16816 [002] ....... 1927.142641: mdio_access: SMI-0 read phy:0x02 reg:0x01 val:0x0000 <- ?!
> kworker/3:4-70 [003] ....... 1927.143037: regmap_reg_read: ethernet-switch reg=1001 val=0
> kworker/3:4-70 [003] ....... 1927.143133: regmap_reg_read: ethernet-switch reg=1000 val=2d89
> kworker/3:4-70 [003] ....... 1927.143213: regmap_reg_write: ethernet-switch reg=1004 val=be
> kworker/3:4-70 [003] ....... 1927.143291: regmap_reg_read: ethernet-switch reg=1005 val=0
> kworker/3:4-70 [003] ....... 1927.143368: regmap_reg_read: ethernet-switch reg=1003 val=0
> kworker/3:4-70 [003] ....... 1927.143443: regmap_reg_read: ethernet-switch reg=1002 val=6
>
> The kworker here is polling MIB counters for stats, as evidenced by the
> register 0x1004 that we are writing to (RTL8365MB_MIB_ADDRESS_REG). This
> polling is performed every 3 seconds, but is just one example of such
> unsynchronized access.
>
> Further investigation reveals the underlying problem: if we read from an
> arbitrary register A and this read coincides with the indirect access
> method in rtl8365mb_phy_ocp_read, then the final read from
> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
> register A. The value read back can be readily poisoned by repeatedly
> reading back the value of another register A via debugfs in a busy loop
> via the dd utility or similar.
>
> This issue appears to be unique to the indirect PHY register access
> pattern. In particular, it does not seem to impact similar sequential
> register operations such MIB counter access.
>
> To fix this problem, one must guard against exactly the scenario seen in
> the above trace. In particular, other parts of the driver using the
> regmap API must not be permitted to access the switch registers until
> the PHY register access is complete. Fix this by using the newly
> introduced "nolock" regmap in all PHY-related functions, and by aquiring
> the regmap mutex at the top level of the PHY register access callbacks.
> Although no issue has been observed with PHY register _writes_, this
> change also serializes the indirect access method there. This is done
> purely as a matter of convenience.
>
> Fixes: 4af2950c50c8 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
> Link: https://lore.kernel.org/netdev/CAJq09z5FCgG-+jVT7uxh1a-0CiiFsoKoHYsAWJtiKwv7LXKofQ@mail.gmail.com/
> Reported-by: Arınç ÜNAL <[email protected]>
> Reported-by: Luiz Angelo Daros de Luca <[email protected]>
> Signed-off-by: Alvin Šipraga <[email protected]>
> ---

This implementation where the indirect PHY access blocks out every other
register read and write is only justified if you can prove that you can
stuff just about any unrelated register read or write before
RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG, and this, in and of itself,
will poison what gets read back from RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG.

rtl8365mb_mib_counter_read() doesn't seem like a particularly good
example to prove this, since it appears to be an indirect access
procedure as well. Single register reads or writes would be ideal, like
RTL8365MB_CPU_CTRL_REG, artificially inserted into strategic places.
Ideally you wouldn't even have a DSA or MDIO or PHY driver running.
Just a simple kernel module with access to the regmap, and try to read
something known, like the PHY ID of one of the internal PHYs, via an
open-coded function. Then add extra regmap accesses and see what
corrupts the indirect PHY access procedure.

Are Realtek aware of this and do they confirm the issue? Sounds like
erratum material to me, and a pretty severe one, at that. Alternatively,
we may simply not be understanding the hardware architecture, like for
example the fact that MIB indirect access and PHY indirect access may
share some common bus and must be sequential w.r.t. each other.

Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

Hi Vladimir,

> This implementation where the indirect PHY access blocks out every other
> register read and write is only justified if you can prove that you can
> stuff just about any unrelated register read or write before
> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG, and this, in and of itself,
> will poison what gets read back from RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG.

I was the first one trying to fix this issue reported by Arinç with
SMP devices. At first I thought it was caused by two parallel indirect
access reads polling the interface (it was not using interrupts). With
no lock, they will eventually collide and one reads the result of the
other one. However, a simple lock over the indirect access didn't
solve the issue. Alvin tested it much further to isolate that indirect
register access is messed up by any other register read. The fails
while polling the interface status or the other test Alvin created
only manifests in a device with multiple cores and mine is single
core. I do get something similar in a single core device by reading an
unused register address but it is hard to blame Realtek when we are
doing something we were not supposed to do. Anyway, that indicates
that "reading a register" is not an atomic operation inside the switch
asic.

> rtl8365mb_mib_counter_read() doesn't seem like a particularly good
> example to prove this, since it appears to be an indirect access
> procedure as well. Single register reads or writes would be ideal, like
> RTL8365MB_CPU_CTRL_REG, artificially inserted into strategic places.
> Ideally you wouldn't even have a DSA or MDIO or PHY driver running.
> Just a simple kernel module with access to the regmap, and try to read
> something known, like the PHY ID of one of the internal PHYs, via an
> open-coded function. Then add extra regmap accesses and see what
> corrupts the indirect PHY access procedure.

The MIB might be just another example where the issue happens. It was
first noticed with a SMP device without interruptions configured. I
believe it will always fail with that configuration.

> Are Realtek aware of this and do they confirm the issue? Sounds like
> erratum material to me, and a pretty severe one, at that. Alternatively,
> we may simply not be understanding the hardware architecture, like for
> example the fact that MIB indirect access and PHY indirect access may
> share some common bus and must be sequential w.r.t. each other.

The realtek "API/driver" does exactly how the driver was doing. They
do have a lock/unlock placeholder, but only in the equivalent
regmap_{read,write} functions. Indirect access does not use locks at
all (in fact, there is no other mention of "lock" elsewhere), even
being obvious that it is not thread-safe. It was just with a DSA
driver that we started to exercise register access for real, specially
without interruptions. And even in that case, we could only notice
this issue in multicore devices. I believe that, if they know about
this issue, they might not be worried because it has never affected a
real device. It would be very interesting to hear from Realtek but I
do not have the contacts.

Regards,

Luiz

2022-02-17 11:36:13

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

Luiz Angelo Daros de Luca <[email protected]> writes:

>> > I still feel like we are trying to go around a regmap limitation
>> > instead of fixing it there. If we control regmap lock (we can define a
>> > custom lock/unlock) and create new regmap_{read,write}_nolock
>> > variants, we'll just need to lock the regmap, do whatever you need,
>> > and unlock it.
>>
>> Can you show me what those regmap_{read,write}_nolock variants would
>> look like in your example? And what about the other regmap_ APIs we use,
>> like regmap_read_poll_timeout, regmap_update_bits, etc. - do you propose
>> to reimplement all of these?
>
> The option of having two regmaps is a nice way to have "_nolock"
> variants for free. It is much cleaner than any solutions I imagined!
> Ayway, I don't believe the regmap API expects to have an evil
> non-locked clone. It looks like it is being abused.
>
> What regmap API misses is a way to create a "transaction". Mdio, for
> example, expects the user to lock the bus before doing a series of
> accesses while regmap api assumes a single atomic access is enough.
> However, Realtek indirect register access shows that it is not enough.
> We could reimplement a mutex for every case where two calls might
> share the same register (or indirectly affect others like we saw with
> Realtek) but I believe a shared solution would be better, even if it
> costs a couple more wrap functions.
>
> It would be even nicer if we have a regmap "manual lock" mode that
> will expose the lock/unlock functions but it will never call them by
> itself. It would work if it could check if the caller is actually the
> same thread/context that locked it. However I doubt there is a clean
> solution in a kernel code that can check if the lock was acquired by
> the same context that is calling the read.

I went through all of this while preparing the patch, so your arguments
are familiar to me ;-)

What I sent was the cleanest solution I could eventually think of. I
don't think it is foul play, but I agree it is a bit funny to have this
kind of "shadow regmap". However, the interface is quite safe, and as I
implied in the commit message, quite foolproof as well.

Basically, rather than reimplementing every regmap API that I want to
use while manually taking the lock, I just use another regmap with
locking disabled. It boils down to exactly the same thing.

>
>
>> > BTW, I believe that, for realtek-mdio, a regmap custom lock mechanism
>> > could simply use mdio lock while realtek-smi already has priv->lock.
>>
>> Hmm OK. Actually I'm a bit confused about the mdio_lock: can you explain
>> what it's guarding against, for someone unfamiliar with MDIO? Currently
>> realtek-mdio's regmap has an additional lock around it (disable_locking
>> is 0), so with these patches applied the number of locks remains the
>> same.
>
> Today we already have to redundants locks (mdio and regmap). Your
> patch is just replacing the regmap lock.

Is that so? Andrew seems to imply that you shouldn't be using the
mdio_lock like this, but only for per-register access, and then
implement your own higher level lock:

> For PHYs this is sufficient. For switches, sometimes you need
> additional protection. The granularity of an access might not be a
> single register read or a write. It could be you need to read or write
> a few registers in an atomic way. If that is the case, you need a lock
> at a higher level.

It seems to me like you should have used mdiobus_{read,write} or even
mdiobus_{read,write}_nested? Although the state of the art in other DSA
drivers seems like a mixed bag, so I don't know.

Since I do not have an MDIO switch in front of me to test with, and
since the existing MDIO code looks a little suspect, again I would
prefer to stick in my lane and just fix the problem without
refactoring.

>
> regmap_read is something like this:
>
> regmap_read
> lock regmap
> realtek_mdio_read()
> lock mdio
> ...
> unlock mdio
> unlock regmap
>
> If you are implementing a custom lock, simply use mdio lock directly.
>
> And the map_nolock you created does not mean "access without locks"
> but "you must lock it yourself before using anything here". If that
> lock is actually mdio_lock, it would be ok to remove the lock inside
> realtek_mdio_{read,write}. You just need a reference to those
> lock/unlock functions in realtek_priv.
>
>> priv->lock is a spinlock which is inappropriate here. I'm not really
>> sure what the point of it is, besides to handle unlocked calls to the
>> _noack function. It might be removable altogether but I would prefer not
>> to touch it for this series.
>
> If spinlock is inappropriate, it can be easily converted to a mutex.
> Everything else from realtek-mdio might apply.

Well, this is a bugfix series, not a refactoring. I am not adding more
locks than were here before. If I start touching old code (this spinlock
predates my engagement with this driver), I will have to answer to that
in the commit message too. If we want to do this, let's do it after the
bugfix has been reviewed and merged. It will be easier to justify as
well.

Kind regards,
Alvin

>
>> Kind regards,
>> Alvin

2022-02-17 12:07:46

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

Luiz Angelo Daros de Luca <[email protected]> writes:

> Hi Vladimir,
>
>> This implementation where the indirect PHY access blocks out every other
>> register read and write is only justified if you can prove that you can
>> stuff just about any unrelated register read or write before
>> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG, and this, in and of itself,
>> will poison what gets read back from RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG.
>
> I was the first one trying to fix this issue reported by Arinç with
> SMP devices. At first I thought it was caused by two parallel indirect
> access reads polling the interface (it was not using interrupts). With
> no lock, they will eventually collide and one reads the result of the
> other one. However, a simple lock over the indirect access didn't
> solve the issue. Alvin tested it much further to isolate that indirect
> register access is messed up by any other register read. The fails
> while polling the interface status or the other test Alvin created
> only manifests in a device with multiple cores and mine is single
> core. I do get something similar in a single core device by reading an
> unused register address but it is hard to blame Realtek when we are
> doing something we were not supposed to do. Anyway, that indicates
> that "reading a register" is not an atomic operation inside the switch
> asic.

I never observed any issue which suggests that switch register reads are
not atomic... I mean, they are (and always have been) protected by the
default regmap lock. So what makes you say this?

I have only seen issues related to PHY register access, please enlighten
us if there are other issues.

>
>> rtl8365mb_mib_counter_read() doesn't seem like a particularly good
>> example to prove this, since it appears to be an indirect access
>> procedure as well. Single register reads or writes would be ideal, like
>> RTL8365MB_CPU_CTRL_REG, artificially inserted into strategic places.
>> Ideally you wouldn't even have a DSA or MDIO or PHY driver running.
>> Just a simple kernel module with access to the regmap, and try to read
>> something known, like the PHY ID of one of the internal PHYs, via an
>> open-coded function. Then add extra regmap accesses and see what
>> corrupts the indirect PHY access procedure.
>
> The MIB might be just another example where the issue happens. It was
> first noticed with a SMP device without interruptions configured. I
> believe it will always fail with that configuration.

As I stated in the last thread, I tested MIB access and the problem did
not manifest itself there.

>
>> Are Realtek aware of this and do they confirm the issue? Sounds like
>> erratum material to me, and a pretty severe one, at that. Alternatively,
>> we may simply not be understanding the hardware architecture, like for
>> example the fact that MIB indirect access and PHY indirect access may
>> share some common bus and must be sequential w.r.t. each other.
>
> The realtek "API/driver" does exactly how the driver was doing. They
> do have a lock/unlock placeholder, but only in the equivalent
> regmap_{read,write} functions. Indirect access does not use locks at
> all (in fact, there is no other mention of "lock" elsewhere), even
> being obvious that it is not thread-safe. It was just with a DSA
> driver that we started to exercise register access for real, specially
> without interruptions. And even in that case, we could only notice
> this issue in multicore devices. I believe that, if they know about
> this issue, they might not be worried because it has never affected a
> real device. It would be very interesting to hear from Realtek but I
> do not have the contacts.

This is not true, at least with the sources I am reading. As I said in
my reply to Vladimir, the Realtek code takes a lock around each
top-level API call. Example:

rtk_api_ret_t rtk_port_phyStatus_get(...)
{
rtk_api_ret_t retVal;

if (NULL == RT_MAPPER->port_phyStatus_get)
return RT_ERR_DRIVER_NOT_FOUND;

RTK_API_LOCK();
retVal = RT_MAPPER->port_phyStatus_get(port, pLinkStatus, pSpeed, pDuplex);
RTK_API_UNLOCK();

return retVal;
}

Deep down in this port_phyStatus_get() callback, the indirect PHY
register access takes place.

Kind regards,
Alvin

2022-02-17 13:43:07

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

> Thank you Andrew for the clear explanation.
>
> Somewhat unrelated to this series, but are you able to explain to me the
> difference between:
>
> mutex_lock(&bus->mdio_lock);
> and
> mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);
>
> While looking at other driver examples I noticed the latter form quite a
> few times too.

This is to do with the debug code for checking for deadlocks,
CONFIG_PROVE_LOCKING. When that feature is enables, each lock/unlock
of a mutex is tracked, and a list is made of what other locks are also
taken, and the order. The code can find deadlocks where one thread
takes A then B, while another thread takes B and then A. It can also
detect when a thread takes lock A and then tries to take lock A again.

Rather than track each individual mutex, it uses classes of mutex. So
bus->mdio_lock is a class of mutex. The code simply tracks that a
bus->mdio_lock has been taken, not a specific bus->mdio_lock. That is
generally sufficient, but not always. The mv88e6xxx switch is like
many switches, accessed over MDIO. But the mv88e6xxx switch offers an
MDIO bus, and there is an MDIO bus driver inside the mv88e6xxx
driver. So you have nested MDIO calls. So this debug code seems the
same class of mutex being taken twice, and thinks it is a
deadlock. You can tell it that nested MDIO calls are actually O.K, it
won't deadlock.

Andrew

2022-02-17 15:23:22

by Vladimir Oltean

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

On Thu, Feb 17, 2022 at 07:41:32AM +0000, Alvin Šipraga wrote:
> Vladimir Oltean <[email protected]> writes:
>
> > On Wed, Feb 16, 2022 at 05:05:00PM +0100, Alvin Šipraga wrote:
> >> From: Alvin Šipraga <[email protected]>
> >>
> >> Realtek switches in the rtl8365mb family can access the PHY registers of
> >> the internal PHYs via the switch registers. This method is called
> >> indirect access. At a high level, the indirect PHY register access
> >> method involves reading and writing some special switch registers in a
> >> particular sequence. This works for both SMI and MDIO connected
> >> switches.
> >>
> >> Currently the rtl8365mb driver does not take any care to serialize the
> >> aforementioned access to the switch registers. In particular, it is
> >> permitted for other driver code to access other switch registers while
> >> the indirect PHY register access is ongoing. Locking is only done at the
> >> regmap level. This, however, is a bug: concurrent register access, even
> >> to unrelated switch registers, risks corrupting the PHY register value
> >> read back via the indirect access method described above.
> >>
> >> Arınç reported that the switch sometimes returns nonsense data when
> >> reading the PHY registers. In particular, a value of 0 causes the
> >> kernel's PHY subsystem to think that the link is down, but since most
> >> reads return correct data, the link then flip-flops between up and down
> >> over a period of time.
> >>
> >> The aforementioned bug can be readily observed by:
> >>
> >> 1. Enabling ftrace events for regmap and mdio
> >> 2. Polling BSMR PHY register for a connected port;
> >> it should always read the same (e.g. 0x79ed)
> >> 3. Wait for step 2 to give a different value
> >>
> >> Example command for step 2:
> >>
> >> while true; do phytool read swp2/2/0x01; done
> >>
> >> On my i.MX8MM, the above steps will yield a bogus value for the BSMR PHY
> >> register within a matter of seconds. The interleaved register access it
> >> then evident in the trace log:
> >>
> >> kworker/3:4-70 [003] ....... 1927.139849: regmap_reg_write: ethernet-switch reg=1004 val=bd
> >> phytool-16816 [002] ....... 1927.139979: regmap_reg_read: ethernet-switch reg=1f01 val=0
> >> kworker/3:4-70 [003] ....... 1927.140381: regmap_reg_read: ethernet-switch reg=1005 val=0
> >> phytool-16816 [002] ....... 1927.140468: regmap_reg_read: ethernet-switch reg=1d15 val=a69
> >> kworker/3:4-70 [003] ....... 1927.140864: regmap_reg_read: ethernet-switch reg=1003 val=0
> >> phytool-16816 [002] ....... 1927.140955: regmap_reg_write: ethernet-switch reg=1f02 val=2041
> >> kworker/3:4-70 [003] ....... 1927.141390: regmap_reg_read: ethernet-switch reg=1002 val=0
> >> phytool-16816 [002] ....... 1927.141479: regmap_reg_write: ethernet-switch reg=1f00 val=1
> >> kworker/3:4-70 [003] ....... 1927.142311: regmap_reg_write: ethernet-switch reg=1004 val=be
> >> phytool-16816 [002] ....... 1927.142410: regmap_reg_read: ethernet-switch reg=1f01 val=0
> >> kworker/3:4-70 [003] ....... 1927.142534: regmap_reg_read: ethernet-switch reg=1005 val=0
> >> phytool-16816 [002] ....... 1927.142618: regmap_reg_read: ethernet-switch reg=1f04 val=0
> >> phytool-16816 [002] ....... 1927.142641: mdio_access: SMI-0 read phy:0x02 reg:0x01 val:0x0000 <- ?!
> >> kworker/3:4-70 [003] ....... 1927.143037: regmap_reg_read: ethernet-switch reg=1001 val=0
> >> kworker/3:4-70 [003] ....... 1927.143133: regmap_reg_read: ethernet-switch reg=1000 val=2d89
> >> kworker/3:4-70 [003] ....... 1927.143213: regmap_reg_write: ethernet-switch reg=1004 val=be
> >> kworker/3:4-70 [003] ....... 1927.143291: regmap_reg_read: ethernet-switch reg=1005 val=0
> >> kworker/3:4-70 [003] ....... 1927.143368: regmap_reg_read: ethernet-switch reg=1003 val=0
> >> kworker/3:4-70 [003] ....... 1927.143443: regmap_reg_read: ethernet-switch reg=1002 val=6
> >>
> >> The kworker here is polling MIB counters for stats, as evidenced by the
> >> register 0x1004 that we are writing to (RTL8365MB_MIB_ADDRESS_REG). This
> >> polling is performed every 3 seconds, but is just one example of such
> >> unsynchronized access.
> >>
> >> Further investigation reveals the underlying problem: if we read from an
> >> arbitrary register A and this read coincides with the indirect access
> >> method in rtl8365mb_phy_ocp_read, then the final read from
> >> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
> >> register A. The value read back can be readily poisoned by repeatedly
> >> reading back the value of another register A via debugfs in a busy loop
> >> via the dd utility or similar.
> >>
> >> This issue appears to be unique to the indirect PHY register access
> >> pattern. In particular, it does not seem to impact similar sequential
> >> register operations such MIB counter access.
> >>
> >> To fix this problem, one must guard against exactly the scenario seen in
> >> the above trace. In particular, other parts of the driver using the
> >> regmap API must not be permitted to access the switch registers until
> >> the PHY register access is complete. Fix this by using the newly
> >> introduced "nolock" regmap in all PHY-related functions, and by aquiring
> >> the regmap mutex at the top level of the PHY register access callbacks.
> >> Although no issue has been observed with PHY register _writes_, this
> >> change also serializes the indirect access method there. This is done
> >> purely as a matter of convenience.
> >>
> >> Fixes: 4af2950c50c8 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
> >> Link: https://lore.kernel.org/netdev/CAJq09z5FCgG-+jVT7uxh1a-0CiiFsoKoHYsAWJtiKwv7LXKofQ@mail.gmail.com/
> >> Reported-by: Arınç ÜNAL <[email protected]>
> >> Reported-by: Luiz Angelo Daros de Luca <[email protected]>
> >> Signed-off-by: Alvin Šipraga <[email protected]>
> >> ---
> >
> > This implementation where the indirect PHY access blocks out every other
> > register read and write is only justified if you can prove that you can
> > stuff just about any unrelated register read or write before
> > RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG, and this, in and of itself,
> > will poison what gets read back from RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG.
>
> I (at least treied to) state that clearly here:
>
> >> Further investigation reveals the underlying problem: if we read from an
> >> arbitrary register A and this read coincides with the indirect access
> >> method in rtl8365mb_phy_ocp_read, then the final read from
> >> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
> >> register A. The value read back can be readily poisoned by repeatedly
> >> reading back the value of another register A via debugfs in a busy loop
> >> via the dd utility or similar.
>
> That is, I used regmap debugfs to spam reads of switch registers like,
> for example, this one:
>
> #define RTL8365MB_CFG0_MAX_LEN_REG 0x088C
>
> ... which controls the MTU of the switch. This is something we set up
> just once to be 0x600 and then it is never touched again. Now in the
> above example, let A = 0x088C. Spamming the read of A phytool command
> described above, I would expect to read a value 0x79c9 out of my BSMR
> PHY register with phytool. But in cases where the read of switch
> register A coincides with the indirect access procedure, I end up
> reading back 0x600 from the PHY register. This is specifically because
> the read of A (=0x600) then poisons the value in
> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG (should be 0x79c9, but is
> 0x600).

Yes, well, that was a bit handwavy, you didn't mention any other
specific register, you just stated a rule which appeared to be inferred
from little evidence.

> > rtl8365mb_mib_counter_read() doesn't seem like a particularly good
> > example to prove this, since it appears to be an indirect access
> > procedure as well. Single register reads or writes would be ideal, like
> > RTL8365MB_CPU_CTRL_REG, artificially inserted into strategic places.
> > Ideally you wouldn't even have a DSA or MDIO or PHY driver running.
>
> I hope it is clear from my above explanation that I did show this, if
> you agree that RTL8365MB_CFG0_MAX_LEN_REG is just as arbitrary as
> RTL8365MB_CPU_CTRL_REG.
>
> What I meant to say here:
>
> >> This issue appears to be unique to the indirect PHY register access
> >> pattern. In particular, it does not seem to impact similar sequential
> >> register operations such MIB counter access.
>
> ... about MIB counter access (which is also indirect as you point out),
> is that it does _not_ suffer from the above problem. The way I checked
> this was with ethtool -S, while again spamming regmap_read of an
> unrelated switch register like CPU_CTRL or CFG0_MAX_LEN. In this case
> the counter values always seem sane, and I can't detect the poisoned
> value getting read back (like 0x600 in the above example).
>
> > Just a simple kernel module with access to the regmap, and try to read
> > something known, like the PHY ID of one of the internal PHYs, via an
> > open-coded function. Then add extra regmap accesses and see what
> > corrupts the indirect PHY access procedure.
>
> The switch is generally idle and I did my testing with the periodic MIB
> counter disabled, so I think what you describe is not far off from what
> I did. The only difference is that the switch was already configured and
> switching packets. I used ftrace events to verify the phenomenon.
>
> If you are still not persuaded, just write me back here, and I will go
> ahead and implement such a test module. But it seems like you
> misunderstood my initial commit message, so perhaps I just need to
> rephrase it?

If the problem you've identified is correct, then this simple test
module would yield the exact same result, yet would eliminate beyond any
doubt the timing and other circumstantial factors, and you could also
do better testing of the PHY write sequence, and MIB counter reads.
And if simply inserting a stray register access in the middle of the PHY
read procedure doesn't produce the same result, this would be new
information. It shouldn't even be too hard to do.

> > Are Realtek aware of this and do they confirm the issue? Sounds like
> > erratum material to me, and a pretty severe one, at that. Alternatively,
> > we may simply not be understanding the hardware architecture, like for
> > example the fact that MIB indirect access and PHY indirect access may
> > share some common bus and must be sequential w.r.t. each other.
>
> The thing is that Realtek's vendor driver takes a common lock around
> every public API call. One of those APIs is "read phy register" and
> there it will take a lock around the whole procedure. At the same time
> it will also take the same lock for something like "read switch MTU" or
> "read CPU tag position", etc. So I don't believe their driver will
> suffer from this issue.
>
> In any case it was on my list to write them a mail about this, so let's
> see what they say.
>
> Kind regards,
> Alvin

I have little to no problem with the workaround you've implemented, it's
just that extraordinary claims require extraordinary proof. Having a
standalone kernel module that can deterministically and not statistically
reproduce the bug would go a long way.

2022-02-17 18:11:39

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

Andrew Lunn <[email protected]> writes:

>> Thank you Andrew for the clear explanation.
>>
>> Somewhat unrelated to this series, but are you able to explain to me the
>> difference between:
>>
>> mutex_lock(&bus->mdio_lock);
>> and
>> mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);
>>
>> While looking at other driver examples I noticed the latter form quite a
>> few times too.
>
> This is to do with the debug code for checking for deadlocks,
> CONFIG_PROVE_LOCKING. When that feature is enables, each lock/unlock
> of a mutex is tracked, and a list is made of what other locks are also
> taken, and the order. The code can find deadlocks where one thread
> takes A then B, while another thread takes B and then A. It can also
> detect when a thread takes lock A and then tries to take lock A again.
>
> Rather than track each individual mutex, it uses classes of mutex. So
> bus->mdio_lock is a class of mutex. The code simply tracks that a
> bus->mdio_lock has been taken, not a specific bus->mdio_lock. That is
> generally sufficient, but not always. The mv88e6xxx switch is like
> many switches, accessed over MDIO. But the mv88e6xxx switch offers an
> MDIO bus, and there is an MDIO bus driver inside the mv88e6xxx
> driver. So you have nested MDIO calls. So this debug code seems the
> same class of mutex being taken twice, and thinks it is a
> deadlock. You can tell it that nested MDIO calls are actually O.K, it
> won't deadlock.

Thanks for the explanation, the missing piece of the puzzle was the fact
that some switch drivers expose an additional MDIO bus. I can understand
the CONFIG_PROVE_LOCKING rationale.

If you have the patience to answer a few more questions:

1. You mentioned in an earlier mail that the mdio_lock is used mostly by
PHY drivers to synchronize their access to the MDIO bus, for a single
read or write. You also mentioned that for switches which have a more
involved access pattern (for instance to access switch management
registers), a higher lock is required. In realtek-mdio this is the case:
we do a couple of reads and writes over the MDIO bus to access the
switch registers. Moreover, the mdio_lock is held for the duration of
these MDIO bus reads/writes. Do you mean to say that one should rather
take a higher-level lock and only lock/unlock the mdio_lock on a
per-read or per-write basis? Put another way, should this:

static int realtek_mdio_read(void *ctx, u32 reg, u32 *val)
{
/* ... */

mutex_lock(&bus->mdio_lock);

bus->write(bus, priv->mdio_addr, ...);
bus->write(bus, priv->mdio_addr, ...);
bus->write(bus, priv->mdio_addr, ...);
bus->read(bus, priv->mdio_addr, ...);

/* ... */

mutex_unlock(&bus->mdio_lock);

return ret;
}

rather look like this?:

static int realtek_mdio_read(void *ctx, u32 reg, u32 *val)
{
/* ... */

mutex_lock(&my_realtek_driver_lock); /* synchronize concurrent realtek_mdio_{read,write} */

mdiobus_write(bus, priv->mdio_addr, ...); /* mdio_lock locked/unlocked here */
mdiobus_write(bus, priv->mdio_addr, ...); /* ditto */
mdiobus_write(bus, priv->mdio_addr, ...); /* ditto */
mdiobus_read(bus, priv->mdio_addr, ...); /* ditto */

/* ... */

mutex_unlock(&my_realtek_driver_lock);

return ret;
}


2. Is the nested locking only relevant for DSA switches which offer
another MDIO bus? Or should all switch drivers do this, on the basis
that, feasibly, one could connect my Realtek switch to the MDIO bus of a
mv88e6xxx switch? In that case, and assuming the latter form of
raeltek_mdio_read above, should one use the mdiobus_{read,write}_nested
functions instead?

Kind regards,
Alvin

2022-02-17 23:43:12

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

Vladimir Oltean <[email protected]> writes:

> On Thu, Feb 17, 2022 at 07:41:32AM +0000, Alvin Šipraga wrote:
>> Vladimir Oltean <[email protected]> writes:
>>
>> > On Wed, Feb 16, 2022 at 05:05:00PM +0100, Alvin Šipraga wrote:
>> >> From: Alvin Šipraga <[email protected]>
>> >>
>> >> Realtek switches in the rtl8365mb family can access the PHY registers of
>> >> the internal PHYs via the switch registers. This method is called
>> >> indirect access. At a high level, the indirect PHY register access
>> >> method involves reading and writing some special switch registers in a
>> >> particular sequence. This works for both SMI and MDIO connected
>> >> switches.
>> >>
>> >> Currently the rtl8365mb driver does not take any care to serialize the
>> >> aforementioned access to the switch registers. In particular, it is
>> >> permitted for other driver code to access other switch registers while
>> >> the indirect PHY register access is ongoing. Locking is only done at the
>> >> regmap level. This, however, is a bug: concurrent register access, even
>> >> to unrelated switch registers, risks corrupting the PHY register value
>> >> read back via the indirect access method described above.
>> >>
>> >> Arınç reported that the switch sometimes returns nonsense data when
>> >> reading the PHY registers. In particular, a value of 0 causes the
>> >> kernel's PHY subsystem to think that the link is down, but since most
>> >> reads return correct data, the link then flip-flops between up and down
>> >> over a period of time.
>> >>
>> >> The aforementioned bug can be readily observed by:
>> >>
>> >> 1. Enabling ftrace events for regmap and mdio
>> >> 2. Polling BSMR PHY register for a connected port;
>> >> it should always read the same (e.g. 0x79ed)
>> >> 3. Wait for step 2 to give a different value
>> >>
>> >> Example command for step 2:
>> >>
>> >> while true; do phytool read swp2/2/0x01; done
>> >>
>> >> On my i.MX8MM, the above steps will yield a bogus value for the BSMR PHY
>> >> register within a matter of seconds. The interleaved register access it
>> >> then evident in the trace log:
>> >>
>> >> kworker/3:4-70 [003] ....... 1927.139849: regmap_reg_write: ethernet-switch reg=1004 val=bd
>> >> phytool-16816 [002] ....... 1927.139979: regmap_reg_read: ethernet-switch reg=1f01 val=0
>> >> kworker/3:4-70 [003] ....... 1927.140381: regmap_reg_read: ethernet-switch reg=1005 val=0
>> >> phytool-16816 [002] ....... 1927.140468: regmap_reg_read: ethernet-switch reg=1d15 val=a69
>> >> kworker/3:4-70 [003] ....... 1927.140864: regmap_reg_read: ethernet-switch reg=1003 val=0
>> >> phytool-16816 [002] ....... 1927.140955: regmap_reg_write: ethernet-switch reg=1f02 val=2041
>> >> kworker/3:4-70 [003] ....... 1927.141390: regmap_reg_read: ethernet-switch reg=1002 val=0
>> >> phytool-16816 [002] ....... 1927.141479: regmap_reg_write: ethernet-switch reg=1f00 val=1
>> >> kworker/3:4-70 [003] ....... 1927.142311: regmap_reg_write: ethernet-switch reg=1004 val=be
>> >> phytool-16816 [002] ....... 1927.142410: regmap_reg_read: ethernet-switch reg=1f01 val=0
>> >> kworker/3:4-70 [003] ....... 1927.142534: regmap_reg_read: ethernet-switch reg=1005 val=0
>> >> phytool-16816 [002] ....... 1927.142618: regmap_reg_read: ethernet-switch reg=1f04 val=0
>> >> phytool-16816 [002] ....... 1927.142641: mdio_access: SMI-0 read phy:0x02 reg:0x01 val:0x0000 <- ?!
>> >> kworker/3:4-70 [003] ....... 1927.143037: regmap_reg_read: ethernet-switch reg=1001 val=0
>> >> kworker/3:4-70 [003] ....... 1927.143133: regmap_reg_read: ethernet-switch reg=1000 val=2d89
>> >> kworker/3:4-70 [003] ....... 1927.143213: regmap_reg_write: ethernet-switch reg=1004 val=be
>> >> kworker/3:4-70 [003] ....... 1927.143291: regmap_reg_read: ethernet-switch reg=1005 val=0
>> >> kworker/3:4-70 [003] ....... 1927.143368: regmap_reg_read: ethernet-switch reg=1003 val=0
>> >> kworker/3:4-70 [003] ....... 1927.143443: regmap_reg_read: ethernet-switch reg=1002 val=6
>> >>
>> >> The kworker here is polling MIB counters for stats, as evidenced by the
>> >> register 0x1004 that we are writing to (RTL8365MB_MIB_ADDRESS_REG). This
>> >> polling is performed every 3 seconds, but is just one example of such
>> >> unsynchronized access.
>> >>
>> >> Further investigation reveals the underlying problem: if we read from an
>> >> arbitrary register A and this read coincides with the indirect access
>> >> method in rtl8365mb_phy_ocp_read, then the final read from
>> >> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
>> >> register A. The value read back can be readily poisoned by repeatedly
>> >> reading back the value of another register A via debugfs in a busy loop
>> >> via the dd utility or similar.
>> >>
>> >> This issue appears to be unique to the indirect PHY register access
>> >> pattern. In particular, it does not seem to impact similar sequential
>> >> register operations such MIB counter access.
>> >>
>> >> To fix this problem, one must guard against exactly the scenario seen in
>> >> the above trace. In particular, other parts of the driver using the
>> >> regmap API must not be permitted to access the switch registers until
>> >> the PHY register access is complete. Fix this by using the newly
>> >> introduced "nolock" regmap in all PHY-related functions, and by aquiring
>> >> the regmap mutex at the top level of the PHY register access callbacks.
>> >> Although no issue has been observed with PHY register _writes_, this
>> >> change also serializes the indirect access method there. This is done
>> >> purely as a matter of convenience.
>> >>
>> >> Fixes: 4af2950c50c8 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
>> >> Link: https://lore.kernel.org/netdev/CAJq09z5FCgG-+jVT7uxh1a-0CiiFsoKoHYsAWJtiKwv7LXKofQ@mail.gmail.com/
>> >> Reported-by: Arınç ÜNAL <[email protected]>
>> >> Reported-by: Luiz Angelo Daros de Luca <[email protected]>
>> >> Signed-off-by: Alvin Šipraga <[email protected]>
>> >> ---
>> >
>> > This implementation where the indirect PHY access blocks out every other
>> > register read and write is only justified if you can prove that you can
>> > stuff just about any unrelated register read or write before
>> > RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG, and this, in and of itself,
>> > will poison what gets read back from RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG.
>>
>> I (at least treied to) state that clearly here:
>>
>> >> Further investigation reveals the underlying problem: if we read from an
>> >> arbitrary register A and this read coincides with the indirect access
>> >> method in rtl8365mb_phy_ocp_read, then the final read from
>> >> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
>> >> register A. The value read back can be readily poisoned by repeatedly
>> >> reading back the value of another register A via debugfs in a busy loop
>> >> via the dd utility or similar.
>>
>> That is, I used regmap debugfs to spam reads of switch registers like,
>> for example, this one:
>>
>> #define RTL8365MB_CFG0_MAX_LEN_REG 0x088C
>>
>> ... which controls the MTU of the switch. This is something we set up
>> just once to be 0x600 and then it is never touched again. Now in the
>> above example, let A = 0x088C. Spamming the read of A phytool command
>> described above, I would expect to read a value 0x79c9 out of my BSMR
>> PHY register with phytool. But in cases where the read of switch
>> register A coincides with the indirect access procedure, I end up
>> reading back 0x600 from the PHY register. This is specifically because
>> the read of A (=0x600) then poisons the value in
>> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG (should be 0x79c9, but is
>> 0x600).
>
> Yes, well, that was a bit handwavy, you didn't mention any other
> specific register, you just stated a rule which appeared to be inferred
> from little evidence.
>
>> > rtl8365mb_mib_counter_read() doesn't seem like a particularly good
>> > example to prove this, since it appears to be an indirect access
>> > procedure as well. Single register reads or writes would be ideal, like
>> > RTL8365MB_CPU_CTRL_REG, artificially inserted into strategic places.
>> > Ideally you wouldn't even have a DSA or MDIO or PHY driver running.
>>
>> I hope it is clear from my above explanation that I did show this, if
>> you agree that RTL8365MB_CFG0_MAX_LEN_REG is just as arbitrary as
>> RTL8365MB_CPU_CTRL_REG.
>>
>> What I meant to say here:
>>
>> >> This issue appears to be unique to the indirect PHY register access
>> >> pattern. In particular, it does not seem to impact similar sequential
>> >> register operations such MIB counter access.
>>
>> ... about MIB counter access (which is also indirect as you point out),
>> is that it does _not_ suffer from the above problem. The way I checked
>> this was with ethtool -S, while again spamming regmap_read of an
>> unrelated switch register like CPU_CTRL or CFG0_MAX_LEN. In this case
>> the counter values always seem sane, and I can't detect the poisoned
>> value getting read back (like 0x600 in the above example).
>>
>> > Just a simple kernel module with access to the regmap, and try to read
>> > something known, like the PHY ID of one of the internal PHYs, via an
>> > open-coded function. Then add extra regmap accesses and see what
>> > corrupts the indirect PHY access procedure.
>>
>> The switch is generally idle and I did my testing with the periodic MIB
>> counter disabled, so I think what you describe is not far off from what
>> I did. The only difference is that the switch was already configured and
>> switching packets. I used ftrace events to verify the phenomenon.
>>
>> If you are still not persuaded, just write me back here, and I will go
>> ahead and implement such a test module. But it seems like you
>> misunderstood my initial commit message, so perhaps I just need to
>> rephrase it?
>
> If the problem you've identified is correct, then this simple test
> module would yield the exact same result, yet would eliminate beyond any
> doubt the timing and other circumstantial factors, and you could also
> do better testing of the PHY write sequence, and MIB counter reads.
> And if simply inserting a stray register access in the middle of the PHY
> read procedure doesn't produce the same result, this would be new
> information. It shouldn't even be too hard to do.
>
<snip>
>
> I have little to no problem with the workaround you've implemented, it's
> just that extraordinary claims require extraordinary proof. Having a
> standalone kernel module that can deterministically and not statistically
> reproduce the bug would go a long way.

Thanks Vladimir, I very much appreciate your scrutiny here. I'll make
the test module to verify the claims I have made. In the mean time I
asked Realtek if they have any comment.

Kind regards,
Alvin

2022-02-18 00:11:02

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

Vladimir Oltean <[email protected]> writes:

> On Wed, Feb 16, 2022 at 05:05:00PM +0100, Alvin Šipraga wrote:
>> From: Alvin Šipraga <[email protected]>
>>
>> Realtek switches in the rtl8365mb family can access the PHY registers of
>> the internal PHYs via the switch registers. This method is called
>> indirect access. At a high level, the indirect PHY register access
>> method involves reading and writing some special switch registers in a
>> particular sequence. This works for both SMI and MDIO connected
>> switches.
>>
>> Currently the rtl8365mb driver does not take any care to serialize the
>> aforementioned access to the switch registers. In particular, it is
>> permitted for other driver code to access other switch registers while
>> the indirect PHY register access is ongoing. Locking is only done at the
>> regmap level. This, however, is a bug: concurrent register access, even
>> to unrelated switch registers, risks corrupting the PHY register value
>> read back via the indirect access method described above.
>>
>> Arınç reported that the switch sometimes returns nonsense data when
>> reading the PHY registers. In particular, a value of 0 causes the
>> kernel's PHY subsystem to think that the link is down, but since most
>> reads return correct data, the link then flip-flops between up and down
>> over a period of time.
>>
>> The aforementioned bug can be readily observed by:
>>
>> 1. Enabling ftrace events for regmap and mdio
>> 2. Polling BSMR PHY register for a connected port;
>> it should always read the same (e.g. 0x79ed)
>> 3. Wait for step 2 to give a different value
>>
>> Example command for step 2:
>>
>> while true; do phytool read swp2/2/0x01; done
>>
>> On my i.MX8MM, the above steps will yield a bogus value for the BSMR PHY
>> register within a matter of seconds. The interleaved register access it
>> then evident in the trace log:
>>
>> kworker/3:4-70 [003] ....... 1927.139849: regmap_reg_write: ethernet-switch reg=1004 val=bd
>> phytool-16816 [002] ....... 1927.139979: regmap_reg_read: ethernet-switch reg=1f01 val=0
>> kworker/3:4-70 [003] ....... 1927.140381: regmap_reg_read: ethernet-switch reg=1005 val=0
>> phytool-16816 [002] ....... 1927.140468: regmap_reg_read: ethernet-switch reg=1d15 val=a69
>> kworker/3:4-70 [003] ....... 1927.140864: regmap_reg_read: ethernet-switch reg=1003 val=0
>> phytool-16816 [002] ....... 1927.140955: regmap_reg_write: ethernet-switch reg=1f02 val=2041
>> kworker/3:4-70 [003] ....... 1927.141390: regmap_reg_read: ethernet-switch reg=1002 val=0
>> phytool-16816 [002] ....... 1927.141479: regmap_reg_write: ethernet-switch reg=1f00 val=1
>> kworker/3:4-70 [003] ....... 1927.142311: regmap_reg_write: ethernet-switch reg=1004 val=be
>> phytool-16816 [002] ....... 1927.142410: regmap_reg_read: ethernet-switch reg=1f01 val=0
>> kworker/3:4-70 [003] ....... 1927.142534: regmap_reg_read: ethernet-switch reg=1005 val=0
>> phytool-16816 [002] ....... 1927.142618: regmap_reg_read: ethernet-switch reg=1f04 val=0
>> phytool-16816 [002] ....... 1927.142641: mdio_access: SMI-0 read phy:0x02 reg:0x01 val:0x0000 <- ?!
>> kworker/3:4-70 [003] ....... 1927.143037: regmap_reg_read: ethernet-switch reg=1001 val=0
>> kworker/3:4-70 [003] ....... 1927.143133: regmap_reg_read: ethernet-switch reg=1000 val=2d89
>> kworker/3:4-70 [003] ....... 1927.143213: regmap_reg_write: ethernet-switch reg=1004 val=be
>> kworker/3:4-70 [003] ....... 1927.143291: regmap_reg_read: ethernet-switch reg=1005 val=0
>> kworker/3:4-70 [003] ....... 1927.143368: regmap_reg_read: ethernet-switch reg=1003 val=0
>> kworker/3:4-70 [003] ....... 1927.143443: regmap_reg_read: ethernet-switch reg=1002 val=6
>>
>> The kworker here is polling MIB counters for stats, as evidenced by the
>> register 0x1004 that we are writing to (RTL8365MB_MIB_ADDRESS_REG). This
>> polling is performed every 3 seconds, but is just one example of such
>> unsynchronized access.
>>
>> Further investigation reveals the underlying problem: if we read from an
>> arbitrary register A and this read coincides with the indirect access
>> method in rtl8365mb_phy_ocp_read, then the final read from
>> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
>> register A. The value read back can be readily poisoned by repeatedly
>> reading back the value of another register A via debugfs in a busy loop
>> via the dd utility or similar.
>>
>> This issue appears to be unique to the indirect PHY register access
>> pattern. In particular, it does not seem to impact similar sequential
>> register operations such MIB counter access.
>>
>> To fix this problem, one must guard against exactly the scenario seen in
>> the above trace. In particular, other parts of the driver using the
>> regmap API must not be permitted to access the switch registers until
>> the PHY register access is complete. Fix this by using the newly
>> introduced "nolock" regmap in all PHY-related functions, and by aquiring
>> the regmap mutex at the top level of the PHY register access callbacks.
>> Although no issue has been observed with PHY register _writes_, this
>> change also serializes the indirect access method there. This is done
>> purely as a matter of convenience.
>>
>> Fixes: 4af2950c50c8 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
>> Link: https://lore.kernel.org/netdev/CAJq09z5FCgG-+jVT7uxh1a-0CiiFsoKoHYsAWJtiKwv7LXKofQ@mail.gmail.com/
>> Reported-by: Arınç ÜNAL <[email protected]>
>> Reported-by: Luiz Angelo Daros de Luca <[email protected]>
>> Signed-off-by: Alvin Šipraga <[email protected]>
>> ---
>
> This implementation where the indirect PHY access blocks out every other
> register read and write is only justified if you can prove that you can
> stuff just about any unrelated register read or write before
> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG, and this, in and of itself,
> will poison what gets read back from RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG.

I (at least treied to) state that clearly here:

>> Further investigation reveals the underlying problem: if we read from an
>> arbitrary register A and this read coincides with the indirect access
>> method in rtl8365mb_phy_ocp_read, then the final read from
>> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
>> register A. The value read back can be readily poisoned by repeatedly
>> reading back the value of another register A via debugfs in a busy loop
>> via the dd utility or similar.

That is, I used regmap debugfs to spam reads of switch registers like,
for example, this one:

#define RTL8365MB_CFG0_MAX_LEN_REG 0x088C

... which controls the MTU of the switch. This is something we set up
just once to be 0x600 and then it is never touched again. Now in the
above example, let A = 0x088C. Spamming the read of A phytool command
described above, I would expect to read a value 0x79c9 out of my BSMR
PHY register with phytool. But in cases where the read of switch
register A coincides with the indirect access procedure, I end up
reading back 0x600 from the PHY register. This is specifically because
the read of A (=0x600) then poisons the value in
RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG (should be 0x79c9, but is
0x600).

>
> rtl8365mb_mib_counter_read() doesn't seem like a particularly good
> example to prove this, since it appears to be an indirect access
> procedure as well. Single register reads or writes would be ideal, like
> RTL8365MB_CPU_CTRL_REG, artificially inserted into strategic places.
> Ideally you wouldn't even have a DSA or MDIO or PHY driver running.

I hope it is clear from my above explanation that I did show this, if
you agree that RTL8365MB_CFG0_MAX_LEN_REG is just as arbitrary as
RTL8365MB_CPU_CTRL_REG.

What I meant to say here:

>> This issue appears to be unique to the indirect PHY register access
>> pattern. In particular, it does not seem to impact similar sequential
>> register operations such MIB counter access.

... about MIB counter access (which is also indirect as you point out),
is that it does _not_ suffer from the above problem. The way I checked
this was with ethtool -S, while again spamming regmap_read of an
unrelated switch register like CPU_CTRL or CFG0_MAX_LEN. In this case
the counter values always seem sane, and I can't detect the poisoned
value getting read back (like 0x600 in the above example).

> Just a simple kernel module with access to the regmap, and try to read
> something known, like the PHY ID of one of the internal PHYs, via an
> open-coded function. Then add extra regmap accesses and see what
> corrupts the indirect PHY access procedure.

The switch is generally idle and I did my testing with the periodic MIB
counter disabled, so I think what you describe is not far off from what
I did. The only difference is that the switch was already configured and
switching packets. I used ftrace events to verify the phenomenon.

If you are still not persuaded, just write me back here, and I will go
ahead and implement such a test module. But it seems like you
misunderstood my initial commit message, so perhaps I just need to
rephrase it?

>
> Are Realtek aware of this and do they confirm the issue? Sounds like
> erratum material to me, and a pretty severe one, at that. Alternatively,
> we may simply not be understanding the hardware architecture, like for
> example the fact that MIB indirect access and PHY indirect access may
> share some common bus and must be sequential w.r.t. each other.

The thing is that Realtek's vendor driver takes a common lock around
every public API call. One of those APIs is "read phy register" and
there it will take a lock around the whole procedure. At the same time
it will also take the same lock for something like "read switch MTU" or
"read CPU tag position", etc. So I don't believe their driver will
suffer from this issue.

In any case it was on my list to write them a mail about this, so let's
see what they say.

Kind regards,
Alvin

2022-02-18 00:16:35

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

> If you have the patience to answer a few more questions:
>
> 1. You mentioned in an earlier mail that the mdio_lock is used mostly by
> PHY drivers to synchronize their access to the MDIO bus, for a single
> read or write. You also mentioned that for switches which have a more
> involved access pattern (for instance to access switch management
> registers), a higher lock is required. In realtek-mdio this is the case:
> we do a couple of reads and writes over the MDIO bus to access the
> switch registers. Moreover, the mdio_lock is held for the duration of
> these MDIO bus reads/writes. Do you mean to say that one should rather
> take a higher-level lock and only lock/unlock the mdio_lock on a
> per-read or per-write basis? Put another way, should this:
>
> static int realtek_mdio_read(void *ctx, u32 reg, u32 *val)
> {
> /* ... */
>
> mutex_lock(&bus->mdio_lock);
>
> bus->write(bus, priv->mdio_addr, ...);

It would be better to use __mdiobus_write()

> bus->write(bus, priv->mdio_addr, ...);
> bus->write(bus, priv->mdio_addr, ...);
> bus->read(bus, priv->mdio_addr, ...);

__mdiobus_read()

> /* ... */
>
> mutex_unlock(&bus->mdio_lock);
>
> return ret;
> }

You can do this.


> rather look like this?:
>
> static int realtek_mdio_read(void *ctx, u32 reg, u32 *val)
> {
> /* ... */
>
> mutex_lock(&my_realtek_driver_lock); /* synchronize concurrent realtek_mdio_{read,write} */
>
> mdiobus_write(bus, priv->mdio_addr, ...); /* mdio_lock locked/unlocked here */
> mdiobus_write(bus, priv->mdio_addr, ...); /* ditto */
> mdiobus_write(bus, priv->mdio_addr, ...); /* ditto */
> mdiobus_read(bus, priv->mdio_addr, ...); /* ditto */
>
> /* ... */
>
> mutex_unlock(&my_realtek_driver_lock);
>
> return ret;
> }

This would also work. The advantage of this is when you have multiple
switches on one MDIO bus, you can allow parallel operations on those
switches. Also, if there are PHYs on the MDIO bus as well as the
switch, the PHYs can be accessed as well. If you are only doing 3
writes and read, it probably does not matter. If you are going to do a
lot of accesses, maybe read all the MIB values, allowing access to the
PHYs at the same time would be nice.

> 2. Is the nested locking only relevant for DSA switches which offer
> another MDIO bus? Or should all switch drivers do this, on the basis
> that, feasibly, one could connect my Realtek switch to the MDIO bus of a
> mv88e6xxx switch? In that case, and assuming the latter form of
> raeltek_mdio_read above, should one use the mdiobus_{read,write}_nested
> functions instead?

I would suggest you start with plain mdiobus_{read,write}. Using the
_nested could potentially hide a deadlock. If somebody does build
hardware with this sort of chaining, we can change to the _nested
calls.

Andrew

Subject: Re: [PATCH net-next 0/2] net: dsa: realtek: fix PHY register read corruption

> > I still feel like we are trying to go around a regmap limitation
> > instead of fixing it there. If we control regmap lock (we can define a
> > custom lock/unlock) and create new regmap_{read,write}_nolock
> > variants, we'll just need to lock the regmap, do whatever you need,
> > and unlock it.
>
> Can you show me what those regmap_{read,write}_nolock variants would
> look like in your example? And what about the other regmap_ APIs we use,
> like regmap_read_poll_timeout, regmap_update_bits, etc. - do you propose
> to reimplement all of these?

The option of having two regmaps is a nice way to have "_nolock"
variants for free. It is much cleaner than any solutions I imagined!
Ayway, I don't believe the regmap API expects to have an evil
non-locked clone. It looks like it is being abused.

What regmap API misses is a way to create a "transaction". Mdio, for
example, expects the user to lock the bus before doing a series of
accesses while regmap api assumes a single atomic access is enough.
However, Realtek indirect register access shows that it is not enough.
We could reimplement a mutex for every case where two calls might
share the same register (or indirectly affect others like we saw with
Realtek) but I believe a shared solution would be better, even if it
costs a couple more wrap functions.

It would be even nicer if we have a regmap "manual lock" mode that
will expose the lock/unlock functions but it will never call them by
itself. It would work if it could check if the caller is actually the
same thread/context that locked it. However I doubt there is a clean
solution in a kernel code that can check if the lock was acquired by
the same context that is calling the read.


> > BTW, I believe that, for realtek-mdio, a regmap custom lock mechanism
> > could simply use mdio lock while realtek-smi already has priv->lock.
>
> Hmm OK. Actually I'm a bit confused about the mdio_lock: can you explain
> what it's guarding against, for someone unfamiliar with MDIO? Currently
> realtek-mdio's regmap has an additional lock around it (disable_locking
> is 0), so with these patches applied the number of locks remains the
> same.

Today we already have to redundants locks (mdio and regmap). Your
patch is just replacing the regmap lock.

regmap_read is something like this:

regmap_read
lock regmap
realtek_mdio_read()
lock mdio
...
unlock mdio
unlock regmap

If you are implementing a custom lock, simply use mdio lock directly.

And the map_nolock you created does not mean "access without locks"
but "you must lock it yourself before using anything here". If that
lock is actually mdio_lock, it would be ok to remove the lock inside
realtek_mdio_{read,write}. You just need a reference to those
lock/unlock functions in realtek_priv.

> priv->lock is a spinlock which is inappropriate here. I'm not really
> sure what the point of it is, besides to handle unlocked calls to the
> _noack function. It might be removable altogether but I would prefer not
> to touch it for this series.

If spinlock is inappropriate, it can be easily converted to a mutex.
Everything else from realtek-mdio might apply.

> Kind regards,
> Alvin

2022-02-21 15:00:04

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

Hi again Vladimir,

Alvin Šipraga <[email protected]> writes:

> Vladimir Oltean <[email protected]> writes:
>
>> If the problem you've identified is correct, then this simple test
>> module would yield the exact same result, yet would eliminate beyond any
>> doubt the timing and other circumstantial factors, and you could also
>> do better testing of the PHY write sequence, and MIB counter reads.
>> And if simply inserting a stray register access in the middle of the PHY
>> read procedure doesn't produce the same result, this would be new
>> information. It shouldn't even be too hard to do.
>>
> <snip>
>>
>> I have little to no problem with the workaround you've implemented, it's
>> just that extraordinary claims require extraordinary proof. Having a
>> standalone kernel module that can deterministically and not statistically
>> reproduce the bug would go a long way.
>
> Thanks Vladimir, I very much appreciate your scrutiny here. I'll make
> the test module to verify the claims I have made. In the mean time I
> asked Realtek if they have any comment.

So I made a test module which, in summary, checks the following:

1. for PHY reads, at what point does inserting a stray register access
(either read or write) cause the PHY read to fail?
2. for PHY writes, can stray register access also cause failure?
2. for MIB reads, can stray register access also cause failure?

For (1) I instrumented the PHY indirect access functions in the 6
possible places where spurious register access could occur. Of those 6
locations for spurious register access, 4 have no effect: you can put a
read or write to an unrelated register there and the PHY read will
always succeed. I tested this with spurious access to nearly every
available register on the switch.

However, for two locations of spurious register access, the PHY read
_always_ fails. The locations are marked /* XXX */ below:

/* Simplified for brevity */
static int rtl8365mb_phy_ocp_read(struct realtek_priv *priv, int phy,
u32 ocp_addr, u16 *data)
{
rtl8365mb_phy_poll_busy(priv);

rtl8365mb_phy_ocp_prepare(priv, phy, ocp_addr);

/* Execute read operation */
regmap_write(priv->map, RTL8365MB_INDIRECT_ACCESS_CTRL_REG, val);

/* XXX */

rtl8365mb_phy_poll_busy(priv);

/* XXX */

/* Get PHY register data */
regmap_read(priv->map, RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG,
&val);

*data = val & 0xFFFF;

return 0;
}

In the case of a spurious read, the result of that read then poisons the
ongoing PHY read, as suggested before. Again I verified that this is
always the case, for each available register on the switch. Spurious
writes also cause failure, and in the same locations too. I did not
investigate whether the value written is then read back as part of the
PHY read.

For (2) I did something similar to (1), but the difference here is that
I could never get PHY writes to fail. Admittedly not all bits of the PHY
registers tend to be writable, but for those bits that were writable, I
would always then read back what I had written.

For (3) I did something similar to (1), and as claimed previously, this
never resulted in a read failure. Here I had to use the MIB counters of
a disconnected port so that I could assume the values were always 0.

I have attached the test module (and header file generated from an
enormous header file from the Realtek driver sources, so that I could
iterate over every possible register). It is pretty gruesome reading but
gives me confidence in my earlier claims. The only refinements to those
claims are:

- where _exactly_ a spurious register access will cause failure: see the
/* XXX */ in the code snippet upstairs;
- PHY writes seem not to be affected at all.

Finally, I reached out to Realtek, and they confirmed pretty much the
same as above. However, they claim it is not a hardware bug, but merely
a property of the hardware design. Here I paraphrase what was said:

1. Yes, spurious register access during PHY indirect access will cause
the indirect access to fail. This is a result of the hardware design. In
general, _if a read fails, the value read back will be the result of the
last successful read_. This confirms the "register poisoning" described
earlier.

2. MIB access is a different story - this is table lookup, not indirect
access. Table lookup is not affected by spurious register access.

3. Other possible accesses - not currently present in this driver, but
for which I have some WIP changes - include ACL (Access Control List),
L2 (FDB), and MC (MDB) access. But all of these are table access similar
to MIB access, and hence not troubled by spurious register access.

4. HOWEVER, only one table can be accessed at a time. So a lock is
needed here. Currently the only table lookup is MIB access, which is
protected by mib_lock, so we are OK for now.

5. It should be sufficient to lock during indirect PHY register access
as prescribed in my patch.

I hope that clears things up. I will be sending a v2 with a revised
description, including the statements from Realtek and the results of
the tests I ran.

Kind regards,
Alvin


Attachments:
regs.h (278.24 kB)
regs.h
rtltest.c (48.49 kB)
rtltest.c
Download all attachments

2022-02-21 19:15:05

by Vladimir Oltean

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

Hi Alvin,

On Mon, Feb 21, 2022 at 02:50:24PM +0000, Alvin Šipraga wrote:
> So I made a test module which, in summary, checks the following:
>
> 1. for PHY reads, at what point does inserting a stray register access
> (either read or write) cause the PHY read to fail?
> 2. for PHY writes, can stray register access also cause failure?
> 2. for MIB reads, can stray register access also cause failure?
>
> For (1) I instrumented the PHY indirect access functions in the 6
> possible places where spurious register access could occur. Of those 6
> locations for spurious register access, 4 have no effect: you can put a
> read or write to an unrelated register there and the PHY read will
> always succeed. I tested this with spurious access to nearly every
> available register on the switch.
>
> However, for two locations of spurious register access, the PHY read
> _always_ fails. The locations are marked /* XXX */ below:
>
> /* Simplified for brevity */
> static int rtl8365mb_phy_ocp_read(struct realtek_priv *priv, int phy,
> u32 ocp_addr, u16 *data)
> {
> rtl8365mb_phy_poll_busy(priv);
>
> rtl8365mb_phy_ocp_prepare(priv, phy, ocp_addr);
>
> /* Execute read operation */
> regmap_write(priv->map, RTL8365MB_INDIRECT_ACCESS_CTRL_REG, val);
>
> /* XXX */
>
> rtl8365mb_phy_poll_busy(priv);
>
> /* XXX */
>
> /* Get PHY register data */
> regmap_read(priv->map, RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG,
> &val);
>
> *data = val & 0xFFFF;
>
> return 0;
> }
>
> In the case of a spurious read, the result of that read then poisons the
> ongoing PHY read, as suggested before. Again I verified that this is
> always the case, for each available register on the switch. Spurious
> writes also cause failure, and in the same locations too. I did not
> investigate whether the value written is then read back as part of the
> PHY read.
>
> For (2) I did something similar to (1), but the difference here is that
> I could never get PHY writes to fail. Admittedly not all bits of the PHY
> registers tend to be writable, but for those bits that were writable, I
> would always then read back what I had written.
>
> For (3) I did something similar to (1), and as claimed previously, this
> never resulted in a read failure. Here I had to use the MIB counters of
> a disconnected port so that I could assume the values were always 0.
>
> I have attached the test module (and header file generated from an
> enormous header file from the Realtek driver sources, so that I could
> iterate over every possible register). It is pretty gruesome reading but
> gives me confidence in my earlier claims. The only refinements to those
> claims are:
>
> - where _exactly_ a spurious register access will cause failure: see the
> /* XXX */ in the code snippet upstairs;
> - PHY writes seem not to be affected at all.
>
> Finally, I reached out to Realtek, and they confirmed pretty much the
> same as above. However, they claim it is not a hardware bug, but merely
> a property of the hardware design. Here I paraphrase what was said:
>
> 1. Yes, spurious register access during PHY indirect access will cause
> the indirect access to fail. This is a result of the hardware design. In
> general, _if a read fails, the value read back will be the result of the
> last successful read_. This confirms the "register poisoning" described
> earlier.
>
> 2. MIB access is a different story - this is table lookup, not indirect
> access. Table lookup is not affected by spurious register access.
>
> 3. Other possible accesses - not currently present in this driver, but
> for which I have some WIP changes - include ACL (Access Control List),
> L2 (FDB), and MC (MDB) access. But all of these are table access similar
> to MIB access, and hence not troubled by spurious register access.
>
> 4. HOWEVER, only one table can be accessed at a time. So a lock is
> needed here. Currently the only table lookup is MIB access, which is
> protected by mib_lock, so we are OK for now.
>
> 5. It should be sufficient to lock during indirect PHY register access
> as prescribed in my patch.
>
> I hope that clears things up. I will be sending a v2 with a revised
> description, including the statements from Realtek and the results of
> the tests I ran.
>
> Kind regards,
> Alvin

Nice work!

This looks more comprehensive, although regarding check_phy_write(),
my understanding is that you checked cross-reads and cross-writes with
only one register: priv->read_reg is implicitly 0 during the
do_reg_work() -> check_phy_write() call sequence, so that register is
probably PORT0_CGST_HALF_CFG.

Anyway, if Realtek's description is that "if a read fails, the value
read back will be the result of the last successful read", then it's
probably not suprising that cross-reads and cross-writes don't make the
indirect PHY write fail (since there's no register read). I don't have
the background of what is the OCP, but the implication of the above
paragraph seems to be that an indirect PHY read is in essence the read
of a single register, which gets aborted when a read of any other
register except RTL8365MB_INDIRECT_ACCESS_STATUS_REG or
RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG gets initiated.

2022-02-22 00:44:20

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

Vladimir Oltean <[email protected]> writes:

> Hi Alvin,
>
> On Mon, Feb 21, 2022 at 02:50:24PM +0000, Alvin Šipraga wrote:
>> So I made a test module which, in summary, checks the following:
>>
>> 1. for PHY reads, at what point does inserting a stray register access
>> (either read or write) cause the PHY read to fail?
>> 2. for PHY writes, can stray register access also cause failure?
>> 2. for MIB reads, can stray register access also cause failure?
>>
>> For (1) I instrumented the PHY indirect access functions in the 6
>> possible places where spurious register access could occur. Of those 6
>> locations for spurious register access, 4 have no effect: you can put a
>> read or write to an unrelated register there and the PHY read will
>> always succeed. I tested this with spurious access to nearly every
>> available register on the switch.
>>
>> However, for two locations of spurious register access, the PHY read
>> _always_ fails. The locations are marked /* XXX */ below:
>>
>> /* Simplified for brevity */
>> static int rtl8365mb_phy_ocp_read(struct realtek_priv *priv, int phy,
>> u32 ocp_addr, u16 *data)
>> {
>> rtl8365mb_phy_poll_busy(priv);
>>
>> rtl8365mb_phy_ocp_prepare(priv, phy, ocp_addr);
>>
>> /* Execute read operation */
>> regmap_write(priv->map, RTL8365MB_INDIRECT_ACCESS_CTRL_REG, val);
>>
>> /* XXX */
>>
>> rtl8365mb_phy_poll_busy(priv);
>>
>> /* XXX */
>>
>> /* Get PHY register data */
>> regmap_read(priv->map, RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG,
>> &val);
>>
>> *data = val & 0xFFFF;
>>
>> return 0;
>> }
>>
>> In the case of a spurious read, the result of that read then poisons the
>> ongoing PHY read, as suggested before. Again I verified that this is
>> always the case, for each available register on the switch. Spurious
>> writes also cause failure, and in the same locations too. I did not
>> investigate whether the value written is then read back as part of the
>> PHY read.
>>
>> For (2) I did something similar to (1), but the difference here is that
>> I could never get PHY writes to fail. Admittedly not all bits of the PHY
>> registers tend to be writable, but for those bits that were writable, I
>> would always then read back what I had written.
>>
>> For (3) I did something similar to (1), and as claimed previously, this
>> never resulted in a read failure. Here I had to use the MIB counters of
>> a disconnected port so that I could assume the values were always 0.
>>
>> I have attached the test module (and header file generated from an
>> enormous header file from the Realtek driver sources, so that I could
>> iterate over every possible register). It is pretty gruesome reading but
>> gives me confidence in my earlier claims. The only refinements to those
>> claims are:
>>
>> - where _exactly_ a spurious register access will cause failure: see the
>> /* XXX */ in the code snippet upstairs;
>> - PHY writes seem not to be affected at all.
>>
>> Finally, I reached out to Realtek, and they confirmed pretty much the
>> same as above. However, they claim it is not a hardware bug, but merely
>> a property of the hardware design. Here I paraphrase what was said:
>>
>> 1. Yes, spurious register access during PHY indirect access will cause
>> the indirect access to fail. This is a result of the hardware design. In
>> general, _if a read fails, the value read back will be the result of the
>> last successful read_. This confirms the "register poisoning" described
>> earlier.
>>
>> 2. MIB access is a different story - this is table lookup, not indirect
>> access. Table lookup is not affected by spurious register access.
>>
>> 3. Other possible accesses - not currently present in this driver, but
>> for which I have some WIP changes - include ACL (Access Control List),
>> L2 (FDB), and MC (MDB) access. But all of these are table access similar
>> to MIB access, and hence not troubled by spurious register access.
>>
>> 4. HOWEVER, only one table can be accessed at a time. So a lock is
>> needed here. Currently the only table lookup is MIB access, which is
>> protected by mib_lock, so we are OK for now.
>>
>> 5. It should be sufficient to lock during indirect PHY register access
>> as prescribed in my patch.
>>
>> I hope that clears things up. I will be sending a v2 with a revised
>> description, including the statements from Realtek and the results of
>> the tests I ran.
>>
>> Kind regards,
>> Alvin
>
> Nice work!
>
> This looks more comprehensive, although regarding check_phy_write(),
> my understanding is that you checked cross-reads and cross-writes with
> only one register: priv->read_reg is implicitly 0 during the
> do_reg_work() -> check_phy_write() call sequence, so that register is
> probably PORT0_CGST_HALF_CFG.

Woops, my bad - thanks for checking. I added an extra loop in
check_phy_write() now to do the checking with a cross-read of every
(good) register (and moved check_good_regs() before
check_phy_write()). The results are broadly the same, although I start
to get some changes when reaching registers like PORT0_STATUS (for
obvious reasons, since I'm poking the PHY control register). There the
logic of my test starts to break down a bit because a lot of the sanity
checks assume that the registers are non-volatile.

Still, PORT0_STATUS is the ~3000th register in the list, so if all other
cross-reads prior to that don't affect the PHY write, I am convinced
that this is not a problem for PHY writes.

For cross-writes I am always writing the same register anyway.

>
> Anyway, if Realtek's description is that "if a read fails, the value
> read back will be the result of the last successful read", then it's
> probably not suprising that cross-reads and cross-writes don't make the
> indirect PHY write fail (since there's no register read). I don't have
> the background of what is the OCP, but the implication of the above
> paragraph seems to be that an indirect PHY read is in essence the read
> of a single register, which gets aborted when a read of any other
> register except RTL8365MB_INDIRECT_ACCESS_STATUS_REG or
> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG gets initiated.

I agree with what you wrote above, I think it captures the point
succinctly. (I also don't know what OCP stands for.)

Kind regards,
Alvin

Subject: Re: [PATCH net-next 2/2] net: dsa: realtek: rtl8365mb: serialize indirect PHY register access

> > The realtek "API/driver" does exactly how the driver was doing. They
> > do have a lock/unlock placeholder, but only in the equivalent
> > regmap_{read,write} functions. Indirect access does not use locks at
> > all (in fact, there is no other mention of "lock" elsewhere), even
> > being obvious that it is not thread-safe. It was just with a DSA
> > driver that we started to exercise register access for real, specially
> > without interruptions. And even in that case, we could only notice
> > this issue in multicore devices. I believe that, if they know about
> > this issue, they might not be worried because it has never affected a
> > real device. It would be very interesting to hear from Realtek but I
> > do not have the contacts.
>
> This is not true, at least with the sources I am reading. As I said in
> my reply to Vladimir, the Realtek code takes a lock around each
> top-level API call. Example:
>
> rtk_api_ret_t rtk_port_phyStatus_get(...)
> {
> rtk_api_ret_t retVal;
>
> if (NULL == RT_MAPPER->port_phyStatus_get)
> return RT_ERR_DRIVER_NOT_FOUND;
>
> RTK_API_LOCK();
> retVal = RT_MAPPER->port_phyStatus_get(port, pLinkStatus, pSpeed, pDuplex);
> RTK_API_UNLOCK();
>
> return retVal;
> }
>
> Deep down in this port_phyStatus_get() callback, the indirect PHY
> register access takes place.

For the record, in the rtl8367c driver I'm using as reference, there
is no mention of RTK_API_LOCK(). Check here same function you copied:

https://github.com/openwrt/openwrt/blob/aae7af4219e56c2787f675109d9dd1a44a5dcba4/target/linux/mediatek/files-5.10/drivers/net/phy/rtk/rtl8367c/port.c#L1003-L1040

So, this indirect reg access protection is something they added along
the way, probably when they started to use it with SMP systems.

> Kind regards,
> Alvin