2023-10-28 11:17:23

by Mirsad Todorovac

[permalink] [raw]
Subject: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

On RTL8411b the RX unit gets confused if the PHY is powered-down.
This was reported in [0] and confirmed by Realtek. Realtek provided
a sequence to fix the RX unit after PHY wakeup.

A series of about 130 r8168_mac_ocp_write() calls is performed to
program the RTL registers for recovery.

r8168_mac_ocp_write() expands to this code:

static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
{
if (rtl_ocp_reg_failure(reg))
return;

RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
}

static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
{
unsigned long flags;

raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
__r8168_mac_ocp_write(tp, reg, data);
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
}

Register programming is done through RTL_W32() macro which expands into

#define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))

which is further (on Alpha):

extern inline void writel(u32 b, volatile void __iomem *addr)
{
mb();
__raw_writel(b, addr);
}

or on i386/x86_64:

#define build_mmio_write(name, size, type, reg, barrier) \
static inline void name(type val, volatile void __iomem *addr) \
{ asm volatile("mov" size " %0,%1": :reg (val), \
"m" (*(volatile type __force *)addr) barrier); }

build_mmio_write(writel, "l", unsigned int, "r", :"memory")

This obviously involves iat least a compiler barrier.

mb() expands into something like this i.e. on x86_64:

#define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")

This means a whole lot of memory bus barriers: for spin_lock_irqsave(),
memory barrier, writel(), and spin_unlock_irqrestore().

With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
a LOCK storm that will thunder all of the cores and CPUs on the same memory controller
for certain time that locked memory read-modify-write cyclo or I/O takes to finish.

In a sequential case of RTL register programming, the writes to RTL registers
can be coalesced under a same raw spinlock. This can dramatically decrease the
number of bus stalls in a multicore or multi-CPU system:

static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
const struct recover_8411b_info *array)
{
struct recover_8411b_info const *p = array;

while (p->reg) {
if (!rtl_ocp_reg_failure(p->reg))
RTL_W32(tp, OCPDR, OCPAR_FLAG | (p->reg << 15) | p->data);
p++;
}
}

static void r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
const struct recover_8411b_info *array)
{
unsigned long flags;

raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
__r8168_mac_ocp_write_seq(tp, array);
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
}

static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
{

...

/* The following Realtek-provided magic fixes an issue with the RX unit
* getting confused after the PHY having been powered-down.
*/

static const struct recover_8411b_info init_zero_seq[] = {
{ 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
...
};

static const struct recover_8411b_info recover_seq[] = {
{ 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
...
};

static const struct recover_8411b_info final_seq[] = {
{ 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
...
};

r8168_mac_ocp_write_seq(tp, init_zero_seq);
mdelay(3);
r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
r8168_mac_ocp_write_seq(tp, recover_seq);
r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
r8168_mac_ocp_write_seq(tp, final_seq);
}

The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
functions that only changed the function names and the ending of the line, so the actual
hex data is unchanged.

Note that the original reason for the introduction of the commit fe4e8db0392a6
was to enable recovery of the RX unit on the RTL8411b which was confused by the
powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
into a series of about 500+ memory bus locks, most waiting for the main memory read,
modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
the programming sequence to reach RTL NIC registers.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075

Fixes: fe4e8db0392a6 ("r8169: fix issue with confused RX unit after PHY power-down on RTL8411b")
Cc: Heiner Kallweit <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: [email protected]
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mirsad Goran Todorovac <[email protected]>
---
v3:
removed register/mask pair array sentinels, so using ARRAY_SIZE().
avoided duplication of RTL_W32() call code as advised by Heiner.

drivers/net/ethernet/realtek/r8169_main.c | 198 ++++++++--------------
1 file changed, 72 insertions(+), 126 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 361b90007148..3b28bec7098b 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -939,6 +939,32 @@ static void r8168_mac_ocp_modify(struct rtl8169_private *tp, u32 reg, u16 mask,
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
}

+struct e_info_regmask_pair {
+ u32 reg;
+ u32 data;
+};
+
+static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
+ const struct e_info_regmask_pair *array, int len)
+{
+ struct e_info_regmask_pair const *p;
+
+ for (p = array; len--; p++)
+ __r8168_mac_ocp_write(tp, p->reg, p->data);
+}
+
+static void _r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
+ const struct e_info_regmask_pair *array, int len)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
+ __r8168_mac_ocp_write_seq(tp, array, len);
+ raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
+}
+
+#define r8168_mac_ocp_write_seq(tp, a) _r8168_mac_ocp_write_seq(tp, a, ARRAY_SIZE(a))
+
/* Work around a hw issue with RTL8168g PHY, the quirk disables
* PHY MCU interrupts before PHY power-down.
*/
@@ -3107,138 +3133,58 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
/* The following Realtek-provided magic fixes an issue with the RX unit
* getting confused after the PHY having been powered-down.
*/
- r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
+
+ static const struct e_info_regmask_pair init_zero_seq[] = {
+ { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
+ { 0xFC30, 0x0000 }, { 0xFC32, 0x0000 }, { 0xFC34, 0x0000 }, { 0xFC36, 0x0000 },
+ };
+
+ static const struct e_info_regmask_pair recover_seq[] = {
+ { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
+ { 0xF808, 0xE027 }, { 0xF80A, 0xE04F }, { 0xF80C, 0xE05E }, { 0xF80E, 0xE065 },
+ { 0xF810, 0xC602 }, { 0xF812, 0xBE00 }, { 0xF814, 0x0000 }, { 0xF816, 0xC502 },
+ { 0xF818, 0xBD00 }, { 0xF81A, 0x074C }, { 0xF81C, 0xC302 }, { 0xF81E, 0xBB00 },
+ { 0xF820, 0x080A }, { 0xF822, 0x6420 }, { 0xF824, 0x48C2 }, { 0xF826, 0x8C20 },
+ { 0xF828, 0xC516 }, { 0xF82A, 0x64A4 }, { 0xF82C, 0x49C0 }, { 0xF82E, 0xF009 },
+ { 0xF830, 0x74A2 }, { 0xF832, 0x8CA5 }, { 0xF834, 0x74A0 }, { 0xF836, 0xC50E },
+ { 0xF838, 0x9CA2 }, { 0xF83A, 0x1C11 }, { 0xF83C, 0x9CA0 }, { 0xF83E, 0xE006 },
+ { 0xF840, 0x74F8 }, { 0xF842, 0x48C4 }, { 0xF844, 0x8CF8 }, { 0xF846, 0xC404 },
+ { 0xF848, 0xBC00 }, { 0xF84A, 0xC403 }, { 0xF84C, 0xBC00 }, { 0xF84E, 0x0BF2 },
+ { 0xF850, 0x0C0A }, { 0xF852, 0xE434 }, { 0xF854, 0xD3C0 }, { 0xF856, 0x49D9 },
+ { 0xF858, 0xF01F }, { 0xF85A, 0xC526 }, { 0xF85C, 0x64A5 }, { 0xF85E, 0x1400 },
+ { 0xF860, 0xF007 }, { 0xF862, 0x0C01 }, { 0xF864, 0x8CA5 }, { 0xF866, 0x1C15 },
+ { 0xF868, 0xC51B }, { 0xF86A, 0x9CA0 }, { 0xF86C, 0xE013 }, { 0xF86E, 0xC519 },
+ { 0xF870, 0x74A0 }, { 0xF872, 0x48C4 }, { 0xF874, 0x8CA0 }, { 0xF876, 0xC516 },
+ { 0xF878, 0x74A4 }, { 0xF87A, 0x48C8 }, { 0xF87C, 0x48CA }, { 0xF87E, 0x9CA4 },
+ { 0xF880, 0xC512 }, { 0xF882, 0x1B00 }, { 0xF884, 0x9BA0 }, { 0xF886, 0x1B1C },
+ { 0xF888, 0x483F }, { 0xF88A, 0x9BA2 }, { 0xF88C, 0x1B04 }, { 0xF88E, 0xC508 },
+ { 0xF890, 0x9BA0 }, { 0xF892, 0xC505 }, { 0xF894, 0xBD00 }, { 0xF896, 0xC502 },
+ { 0xF898, 0xBD00 }, { 0xF89A, 0x0300 }, { 0xF89C, 0x051E }, { 0xF89E, 0xE434 },
+ { 0xF8A0, 0xE018 }, { 0xF8A2, 0xE092 }, { 0xF8A4, 0xDE20 }, { 0xF8A6, 0xD3C0 },
+ { 0xF8A8, 0xC50F }, { 0xF8AA, 0x76A4 }, { 0xF8AC, 0x49E3 }, { 0xF8AE, 0xF007 },
+ { 0xF8B0, 0x49C0 }, { 0xF8B2, 0xF103 }, { 0xF8B4, 0xC607 }, { 0xF8B6, 0xBE00 },
+ { 0xF8B8, 0xC606 }, { 0xF8BA, 0xBE00 }, { 0xF8BC, 0xC602 }, { 0xF8BE, 0xBE00 },
+ { 0xF8C0, 0x0C4C }, { 0xF8C2, 0x0C28 }, { 0xF8C4, 0x0C2C }, { 0xF8C6, 0xDC00 },
+ { 0xF8C8, 0xC707 }, { 0xF8CA, 0x1D00 }, { 0xF8CC, 0x8DE2 }, { 0xF8CE, 0x48C1 },
+ { 0xF8D0, 0xC502 }, { 0xF8D2, 0xBD00 }, { 0xF8D4, 0x00AA }, { 0xF8D6, 0xE0C0 },
+ { 0xF8D8, 0xC502 }, { 0xF8DA, 0xBD00 }, { 0xF8DC, 0x0132 },
+ };
+
+ static const struct e_info_regmask_pair final_seq[] = {
+ { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
+ { 0xFC32, 0x0C25 }, { 0xFC34, 0x00A9 }, { 0xFC36, 0x012D },
+ };
+
+ r8168_mac_ocp_write_seq(tp, init_zero_seq);
mdelay(3);
r8168_mac_ocp_write(tp, 0xFC26, 0x0000);

- r8168_mac_ocp_write(tp, 0xF800, 0xE008);
- r8168_mac_ocp_write(tp, 0xF802, 0xE00A);
- r8168_mac_ocp_write(tp, 0xF804, 0xE00C);
- r8168_mac_ocp_write(tp, 0xF806, 0xE00E);
- r8168_mac_ocp_write(tp, 0xF808, 0xE027);
- r8168_mac_ocp_write(tp, 0xF80A, 0xE04F);
- r8168_mac_ocp_write(tp, 0xF80C, 0xE05E);
- r8168_mac_ocp_write(tp, 0xF80E, 0xE065);
- r8168_mac_ocp_write(tp, 0xF810, 0xC602);
- r8168_mac_ocp_write(tp, 0xF812, 0xBE00);
- r8168_mac_ocp_write(tp, 0xF814, 0x0000);
- r8168_mac_ocp_write(tp, 0xF816, 0xC502);
- r8168_mac_ocp_write(tp, 0xF818, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF81A, 0x074C);
- r8168_mac_ocp_write(tp, 0xF81C, 0xC302);
- r8168_mac_ocp_write(tp, 0xF81E, 0xBB00);
- r8168_mac_ocp_write(tp, 0xF820, 0x080A);
- r8168_mac_ocp_write(tp, 0xF822, 0x6420);
- r8168_mac_ocp_write(tp, 0xF824, 0x48C2);
- r8168_mac_ocp_write(tp, 0xF826, 0x8C20);
- r8168_mac_ocp_write(tp, 0xF828, 0xC516);
- r8168_mac_ocp_write(tp, 0xF82A, 0x64A4);
- r8168_mac_ocp_write(tp, 0xF82C, 0x49C0);
- r8168_mac_ocp_write(tp, 0xF82E, 0xF009);
- r8168_mac_ocp_write(tp, 0xF830, 0x74A2);
- r8168_mac_ocp_write(tp, 0xF832, 0x8CA5);
- r8168_mac_ocp_write(tp, 0xF834, 0x74A0);
- r8168_mac_ocp_write(tp, 0xF836, 0xC50E);
- r8168_mac_ocp_write(tp, 0xF838, 0x9CA2);
- r8168_mac_ocp_write(tp, 0xF83A, 0x1C11);
- r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0);
- r8168_mac_ocp_write(tp, 0xF83E, 0xE006);
- r8168_mac_ocp_write(tp, 0xF840, 0x74F8);
- r8168_mac_ocp_write(tp, 0xF842, 0x48C4);
- r8168_mac_ocp_write(tp, 0xF844, 0x8CF8);
- r8168_mac_ocp_write(tp, 0xF846, 0xC404);
- r8168_mac_ocp_write(tp, 0xF848, 0xBC00);
- r8168_mac_ocp_write(tp, 0xF84A, 0xC403);
- r8168_mac_ocp_write(tp, 0xF84C, 0xBC00);
- r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2);
- r8168_mac_ocp_write(tp, 0xF850, 0x0C0A);
- r8168_mac_ocp_write(tp, 0xF852, 0xE434);
- r8168_mac_ocp_write(tp, 0xF854, 0xD3C0);
- r8168_mac_ocp_write(tp, 0xF856, 0x49D9);
- r8168_mac_ocp_write(tp, 0xF858, 0xF01F);
- r8168_mac_ocp_write(tp, 0xF85A, 0xC526);
- r8168_mac_ocp_write(tp, 0xF85C, 0x64A5);
- r8168_mac_ocp_write(tp, 0xF85E, 0x1400);
- r8168_mac_ocp_write(tp, 0xF860, 0xF007);
- r8168_mac_ocp_write(tp, 0xF862, 0x0C01);
- r8168_mac_ocp_write(tp, 0xF864, 0x8CA5);
- r8168_mac_ocp_write(tp, 0xF866, 0x1C15);
- r8168_mac_ocp_write(tp, 0xF868, 0xC51B);
- r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0);
- r8168_mac_ocp_write(tp, 0xF86C, 0xE013);
- r8168_mac_ocp_write(tp, 0xF86E, 0xC519);
- r8168_mac_ocp_write(tp, 0xF870, 0x74A0);
- r8168_mac_ocp_write(tp, 0xF872, 0x48C4);
- r8168_mac_ocp_write(tp, 0xF874, 0x8CA0);
- r8168_mac_ocp_write(tp, 0xF876, 0xC516);
- r8168_mac_ocp_write(tp, 0xF878, 0x74A4);
- r8168_mac_ocp_write(tp, 0xF87A, 0x48C8);
- r8168_mac_ocp_write(tp, 0xF87C, 0x48CA);
- r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4);
- r8168_mac_ocp_write(tp, 0xF880, 0xC512);
- r8168_mac_ocp_write(tp, 0xF882, 0x1B00);
- r8168_mac_ocp_write(tp, 0xF884, 0x9BA0);
- r8168_mac_ocp_write(tp, 0xF886, 0x1B1C);
- r8168_mac_ocp_write(tp, 0xF888, 0x483F);
- r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2);
- r8168_mac_ocp_write(tp, 0xF88C, 0x1B04);
- r8168_mac_ocp_write(tp, 0xF88E, 0xC508);
- r8168_mac_ocp_write(tp, 0xF890, 0x9BA0);
- r8168_mac_ocp_write(tp, 0xF892, 0xC505);
- r8168_mac_ocp_write(tp, 0xF894, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF896, 0xC502);
- r8168_mac_ocp_write(tp, 0xF898, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF89A, 0x0300);
- r8168_mac_ocp_write(tp, 0xF89C, 0x051E);
- r8168_mac_ocp_write(tp, 0xF89E, 0xE434);
- r8168_mac_ocp_write(tp, 0xF8A0, 0xE018);
- r8168_mac_ocp_write(tp, 0xF8A2, 0xE092);
- r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20);
- r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0);
- r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F);
- r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4);
- r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3);
- r8168_mac_ocp_write(tp, 0xF8AE, 0xF007);
- r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0);
- r8168_mac_ocp_write(tp, 0xF8B2, 0xF103);
- r8168_mac_ocp_write(tp, 0xF8B4, 0xC607);
- r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00);
- r8168_mac_ocp_write(tp, 0xF8B8, 0xC606);
- r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00);
- r8168_mac_ocp_write(tp, 0xF8BC, 0xC602);
- r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00);
- r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C);
- r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28);
- r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C);
- r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00);
- r8168_mac_ocp_write(tp, 0xF8C8, 0xC707);
- r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00);
- r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2);
- r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1);
- r8168_mac_ocp_write(tp, 0xF8D0, 0xC502);
- r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA);
- r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0);
- r8168_mac_ocp_write(tp, 0xF8D8, 0xC502);
- r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF8DC, 0x0132);
+ r8168_mac_ocp_write_seq(tp, recover_seq);

r8168_mac_ocp_write(tp, 0xFC26, 0x8000);

- r8168_mac_ocp_write(tp, 0xFC2A, 0x0743);
- r8168_mac_ocp_write(tp, 0xFC2C, 0x0801);
- r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9);
- r8168_mac_ocp_write(tp, 0xFC30, 0x02FD);
- r8168_mac_ocp_write(tp, 0xFC32, 0x0C25);
- r8168_mac_ocp_write(tp, 0xFC34, 0x00A9);
- r8168_mac_ocp_write(tp, 0xFC36, 0x012D);
+ r8168_mac_ocp_write_seq(tp, final_seq);
+
}

static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
--
2.34.1


2023-10-28 19:21:57

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

On 28.10.2023 13:05, Mirsad Goran Todorovac wrote:
> On RTL8411b the RX unit gets confused if the PHY is powered-down.
> This was reported in [0] and confirmed by Realtek. Realtek provided
> a sequence to fix the RX unit after PHY wakeup.
>
> A series of about 130 r8168_mac_ocp_write() calls is performed to
> program the RTL registers for recovery.
>
> r8168_mac_ocp_write() expands to this code:
>
> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
> {
> if (rtl_ocp_reg_failure(reg))
> return;
>
> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
> }
>
> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
> {
> unsigned long flags;
>
> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
> __r8168_mac_ocp_write(tp, reg, data);
> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
> }
>
> Register programming is done through RTL_W32() macro which expands into
>
> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>
> which is further (on Alpha):
>
> extern inline void writel(u32 b, volatile void __iomem *addr)
> {
> mb();
> __raw_writel(b, addr);
> }
>
> or on i386/x86_64:
>
> #define build_mmio_write(name, size, type, reg, barrier) \
> static inline void name(type val, volatile void __iomem *addr) \
> { asm volatile("mov" size " %0,%1": :reg (val), \
> "m" (*(volatile type __force *)addr) barrier); }
>
> build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>
> This obviously involves iat least a compiler barrier.
>
> mb() expands into something like this i.e. on x86_64:
>
> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>
> This means a whole lot of memory bus barriers: for spin_lock_irqsave(),
> memory barrier, writel(), and spin_unlock_irqrestore().
>
> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
> a LOCK storm that will thunder all of the cores and CPUs on the same memory controller
> for certain time that locked memory read-modify-write cyclo or I/O takes to finish.
>
> In a sequential case of RTL register programming, the writes to RTL registers
> can be coalesced under a same raw spinlock. This can dramatically decrease the
> number of bus stalls in a multicore or multi-CPU system:
>
> static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
> const struct recover_8411b_info *array)
> {
> struct recover_8411b_info const *p = array;
>
> while (p->reg) {
> if (!rtl_ocp_reg_failure(p->reg))
> RTL_W32(tp, OCPDR, OCPAR_FLAG | (p->reg << 15) | p->data);
> p++;
> }
> }
>
> static void r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
> const struct recover_8411b_info *array)
> {
> unsigned long flags;
>
> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
> __r8168_mac_ocp_write_seq(tp, array);
> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
> }
>
> static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
> {
>
> ...
>
> /* The following Realtek-provided magic fixes an issue with the RX unit
> * getting confused after the PHY having been powered-down.
> */
>
> static const struct recover_8411b_info init_zero_seq[] = {
> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
> ...
> };
>
> static const struct recover_8411b_info recover_seq[] = {
> { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
> ...
> };
>
> static const struct recover_8411b_info final_seq[] = {
> { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
> ...
> };
>
> r8168_mac_ocp_write_seq(tp, init_zero_seq);
> mdelay(3);
> r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
> r8168_mac_ocp_write_seq(tp, recover_seq);
> r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
> r8168_mac_ocp_write_seq(tp, final_seq);
> }
>
> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
> functions that only changed the function names and the ending of the line, so the actual
> hex data is unchanged.
>
> Note that the original reason for the introduction of the commit fe4e8db0392a6
> was to enable recovery of the RX unit on the RTL8411b which was confused by the
> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem

I still have a problem with this statement as you're saying that the original
problem still exists. I don't think that's the case.

> into a series of about 500+ memory bus locks, most waiting for the main memory read,
> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
> the programming sequence to reach RTL NIC registers.
>
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>
> Fixes: fe4e8db0392a6 ("r8169: fix issue with confused RX unit after PHY power-down on RTL8411b")
> Cc: Heiner Kallweit <[email protected]>
> Cc: Marco Elver <[email protected]>
> Cc: [email protected]
> Cc: "David S. Miller" <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Jakub Kicinski <[email protected]>
> Cc: Paolo Abeni <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Link: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
> ---
> v3:
> removed register/mask pair array sentinels, so using ARRAY_SIZE().
> avoided duplication of RTL_W32() call code as advised by Heiner.
>
> drivers/net/ethernet/realtek/r8169_main.c | 198 ++++++++--------------
> 1 file changed, 72 insertions(+), 126 deletions(-)
>
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> index 361b90007148..3b28bec7098b 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -939,6 +939,32 @@ static void r8168_mac_ocp_modify(struct rtl8169_private *tp, u32 reg, u16 mask,
> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
> }
>
> +struct e_info_regmask_pair {
> + u32 reg;
> + u32 data;
> +};
> +
> +static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
> + const struct e_info_regmask_pair *array, int len)
> +{
> + struct e_info_regmask_pair const *p;
> +
> + for (p = array; len--; p++)
> + __r8168_mac_ocp_write(tp, p->reg, p->data);
> +}
> +
> +static void _r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
> + const struct e_info_regmask_pair *array, int len)
> +{
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
> + __r8168_mac_ocp_write_seq(tp, array, len);
> + raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
> +}
> +
> +#define r8168_mac_ocp_write_seq(tp, a) _r8168_mac_ocp_write_seq(tp, a, ARRAY_SIZE(a))
> +
> /* Work around a hw issue with RTL8168g PHY, the quirk disables
> * PHY MCU interrupts before PHY power-down.
> */
> @@ -3107,138 +3133,58 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
> /* The following Realtek-provided magic fixes an issue with the RX unit
> * getting confused after the PHY having been powered-down.
> */
> - r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
> - r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
> - r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
> - r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
> - r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
> - r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
> - r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
> - r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
> +
> + static const struct e_info_regmask_pair init_zero_seq[] = {
> + { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
> + { 0xFC30, 0x0000 }, { 0xFC32, 0x0000 }, { 0xFC34, 0x0000 }, { 0xFC36, 0x0000 },
> + };
> +
Don't mix code and variable declarations. Did you run checkpatch?
I think it would complain here.

> + static const struct e_info_regmask_pair recover_seq[] = {
> + { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
> + { 0xF808, 0xE027 }, { 0xF80A, 0xE04F }, { 0xF80C, 0xE05E }, { 0xF80E, 0xE065 },
> + { 0xF810, 0xC602 }, { 0xF812, 0xBE00 }, { 0xF814, 0x0000 }, { 0xF816, 0xC502 },
> + { 0xF818, 0xBD00 }, { 0xF81A, 0x074C }, { 0xF81C, 0xC302 }, { 0xF81E, 0xBB00 },
> + { 0xF820, 0x080A }, { 0xF822, 0x6420 }, { 0xF824, 0x48C2 }, { 0xF826, 0x8C20 },
> + { 0xF828, 0xC516 }, { 0xF82A, 0x64A4 }, { 0xF82C, 0x49C0 }, { 0xF82E, 0xF009 },
> + { 0xF830, 0x74A2 }, { 0xF832, 0x8CA5 }, { 0xF834, 0x74A0 }, { 0xF836, 0xC50E },
> + { 0xF838, 0x9CA2 }, { 0xF83A, 0x1C11 }, { 0xF83C, 0x9CA0 }, { 0xF83E, 0xE006 },
> + { 0xF840, 0x74F8 }, { 0xF842, 0x48C4 }, { 0xF844, 0x8CF8 }, { 0xF846, 0xC404 },
> + { 0xF848, 0xBC00 }, { 0xF84A, 0xC403 }, { 0xF84C, 0xBC00 }, { 0xF84E, 0x0BF2 },
> + { 0xF850, 0x0C0A }, { 0xF852, 0xE434 }, { 0xF854, 0xD3C0 }, { 0xF856, 0x49D9 },
> + { 0xF858, 0xF01F }, { 0xF85A, 0xC526 }, { 0xF85C, 0x64A5 }, { 0xF85E, 0x1400 },
> + { 0xF860, 0xF007 }, { 0xF862, 0x0C01 }, { 0xF864, 0x8CA5 }, { 0xF866, 0x1C15 },
> + { 0xF868, 0xC51B }, { 0xF86A, 0x9CA0 }, { 0xF86C, 0xE013 }, { 0xF86E, 0xC519 },
> + { 0xF870, 0x74A0 }, { 0xF872, 0x48C4 }, { 0xF874, 0x8CA0 }, { 0xF876, 0xC516 },
> + { 0xF878, 0x74A4 }, { 0xF87A, 0x48C8 }, { 0xF87C, 0x48CA }, { 0xF87E, 0x9CA4 },
> + { 0xF880, 0xC512 }, { 0xF882, 0x1B00 }, { 0xF884, 0x9BA0 }, { 0xF886, 0x1B1C },
> + { 0xF888, 0x483F }, { 0xF88A, 0x9BA2 }, { 0xF88C, 0x1B04 }, { 0xF88E, 0xC508 },
> + { 0xF890, 0x9BA0 }, { 0xF892, 0xC505 }, { 0xF894, 0xBD00 }, { 0xF896, 0xC502 },
> + { 0xF898, 0xBD00 }, { 0xF89A, 0x0300 }, { 0xF89C, 0x051E }, { 0xF89E, 0xE434 },
> + { 0xF8A0, 0xE018 }, { 0xF8A2, 0xE092 }, { 0xF8A4, 0xDE20 }, { 0xF8A6, 0xD3C0 },
> + { 0xF8A8, 0xC50F }, { 0xF8AA, 0x76A4 }, { 0xF8AC, 0x49E3 }, { 0xF8AE, 0xF007 },
> + { 0xF8B0, 0x49C0 }, { 0xF8B2, 0xF103 }, { 0xF8B4, 0xC607 }, { 0xF8B6, 0xBE00 },
> + { 0xF8B8, 0xC606 }, { 0xF8BA, 0xBE00 }, { 0xF8BC, 0xC602 }, { 0xF8BE, 0xBE00 },
> + { 0xF8C0, 0x0C4C }, { 0xF8C2, 0x0C28 }, { 0xF8C4, 0x0C2C }, { 0xF8C6, 0xDC00 },
> + { 0xF8C8, 0xC707 }, { 0xF8CA, 0x1D00 }, { 0xF8CC, 0x8DE2 }, { 0xF8CE, 0x48C1 },
> + { 0xF8D0, 0xC502 }, { 0xF8D2, 0xBD00 }, { 0xF8D4, 0x00AA }, { 0xF8D6, 0xE0C0 },
> + { 0xF8D8, 0xC502 }, { 0xF8DA, 0xBD00 }, { 0xF8DC, 0x0132 },
> + };
> +
> + static const struct e_info_regmask_pair final_seq[] = {
> + { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
> + { 0xFC32, 0x0C25 }, { 0xFC34, 0x00A9 }, { 0xFC36, 0x012D },
> + };
> +
> + r8168_mac_ocp_write_seq(tp, init_zero_seq);
> mdelay(3);
> r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>
> - r8168_mac_ocp_write(tp, 0xF800, 0xE008);
> - r8168_mac_ocp_write(tp, 0xF802, 0xE00A);
> - r8168_mac_ocp_write(tp, 0xF804, 0xE00C);
> - r8168_mac_ocp_write(tp, 0xF806, 0xE00E);
> - r8168_mac_ocp_write(tp, 0xF808, 0xE027);
> - r8168_mac_ocp_write(tp, 0xF80A, 0xE04F);
> - r8168_mac_ocp_write(tp, 0xF80C, 0xE05E);
> - r8168_mac_ocp_write(tp, 0xF80E, 0xE065);
> - r8168_mac_ocp_write(tp, 0xF810, 0xC602);
> - r8168_mac_ocp_write(tp, 0xF812, 0xBE00);
> - r8168_mac_ocp_write(tp, 0xF814, 0x0000);
> - r8168_mac_ocp_write(tp, 0xF816, 0xC502);
> - r8168_mac_ocp_write(tp, 0xF818, 0xBD00);
> - r8168_mac_ocp_write(tp, 0xF81A, 0x074C);
> - r8168_mac_ocp_write(tp, 0xF81C, 0xC302);
> - r8168_mac_ocp_write(tp, 0xF81E, 0xBB00);
> - r8168_mac_ocp_write(tp, 0xF820, 0x080A);
> - r8168_mac_ocp_write(tp, 0xF822, 0x6420);
> - r8168_mac_ocp_write(tp, 0xF824, 0x48C2);
> - r8168_mac_ocp_write(tp, 0xF826, 0x8C20);
> - r8168_mac_ocp_write(tp, 0xF828, 0xC516);
> - r8168_mac_ocp_write(tp, 0xF82A, 0x64A4);
> - r8168_mac_ocp_write(tp, 0xF82C, 0x49C0);
> - r8168_mac_ocp_write(tp, 0xF82E, 0xF009);
> - r8168_mac_ocp_write(tp, 0xF830, 0x74A2);
> - r8168_mac_ocp_write(tp, 0xF832, 0x8CA5);
> - r8168_mac_ocp_write(tp, 0xF834, 0x74A0);
> - r8168_mac_ocp_write(tp, 0xF836, 0xC50E);
> - r8168_mac_ocp_write(tp, 0xF838, 0x9CA2);
> - r8168_mac_ocp_write(tp, 0xF83A, 0x1C11);
> - r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0);
> - r8168_mac_ocp_write(tp, 0xF83E, 0xE006);
> - r8168_mac_ocp_write(tp, 0xF840, 0x74F8);
> - r8168_mac_ocp_write(tp, 0xF842, 0x48C4);
> - r8168_mac_ocp_write(tp, 0xF844, 0x8CF8);
> - r8168_mac_ocp_write(tp, 0xF846, 0xC404);
> - r8168_mac_ocp_write(tp, 0xF848, 0xBC00);
> - r8168_mac_ocp_write(tp, 0xF84A, 0xC403);
> - r8168_mac_ocp_write(tp, 0xF84C, 0xBC00);
> - r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2);
> - r8168_mac_ocp_write(tp, 0xF850, 0x0C0A);
> - r8168_mac_ocp_write(tp, 0xF852, 0xE434);
> - r8168_mac_ocp_write(tp, 0xF854, 0xD3C0);
> - r8168_mac_ocp_write(tp, 0xF856, 0x49D9);
> - r8168_mac_ocp_write(tp, 0xF858, 0xF01F);
> - r8168_mac_ocp_write(tp, 0xF85A, 0xC526);
> - r8168_mac_ocp_write(tp, 0xF85C, 0x64A5);
> - r8168_mac_ocp_write(tp, 0xF85E, 0x1400);
> - r8168_mac_ocp_write(tp, 0xF860, 0xF007);
> - r8168_mac_ocp_write(tp, 0xF862, 0x0C01);
> - r8168_mac_ocp_write(tp, 0xF864, 0x8CA5);
> - r8168_mac_ocp_write(tp, 0xF866, 0x1C15);
> - r8168_mac_ocp_write(tp, 0xF868, 0xC51B);
> - r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0);
> - r8168_mac_ocp_write(tp, 0xF86C, 0xE013);
> - r8168_mac_ocp_write(tp, 0xF86E, 0xC519);
> - r8168_mac_ocp_write(tp, 0xF870, 0x74A0);
> - r8168_mac_ocp_write(tp, 0xF872, 0x48C4);
> - r8168_mac_ocp_write(tp, 0xF874, 0x8CA0);
> - r8168_mac_ocp_write(tp, 0xF876, 0xC516);
> - r8168_mac_ocp_write(tp, 0xF878, 0x74A4);
> - r8168_mac_ocp_write(tp, 0xF87A, 0x48C8);
> - r8168_mac_ocp_write(tp, 0xF87C, 0x48CA);
> - r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4);
> - r8168_mac_ocp_write(tp, 0xF880, 0xC512);
> - r8168_mac_ocp_write(tp, 0xF882, 0x1B00);
> - r8168_mac_ocp_write(tp, 0xF884, 0x9BA0);
> - r8168_mac_ocp_write(tp, 0xF886, 0x1B1C);
> - r8168_mac_ocp_write(tp, 0xF888, 0x483F);
> - r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2);
> - r8168_mac_ocp_write(tp, 0xF88C, 0x1B04);
> - r8168_mac_ocp_write(tp, 0xF88E, 0xC508);
> - r8168_mac_ocp_write(tp, 0xF890, 0x9BA0);
> - r8168_mac_ocp_write(tp, 0xF892, 0xC505);
> - r8168_mac_ocp_write(tp, 0xF894, 0xBD00);
> - r8168_mac_ocp_write(tp, 0xF896, 0xC502);
> - r8168_mac_ocp_write(tp, 0xF898, 0xBD00);
> - r8168_mac_ocp_write(tp, 0xF89A, 0x0300);
> - r8168_mac_ocp_write(tp, 0xF89C, 0x051E);
> - r8168_mac_ocp_write(tp, 0xF89E, 0xE434);
> - r8168_mac_ocp_write(tp, 0xF8A0, 0xE018);
> - r8168_mac_ocp_write(tp, 0xF8A2, 0xE092);
> - r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20);
> - r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0);
> - r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F);
> - r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4);
> - r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3);
> - r8168_mac_ocp_write(tp, 0xF8AE, 0xF007);
> - r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0);
> - r8168_mac_ocp_write(tp, 0xF8B2, 0xF103);
> - r8168_mac_ocp_write(tp, 0xF8B4, 0xC607);
> - r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00);
> - r8168_mac_ocp_write(tp, 0xF8B8, 0xC606);
> - r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00);
> - r8168_mac_ocp_write(tp, 0xF8BC, 0xC602);
> - r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00);
> - r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C);
> - r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28);
> - r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C);
> - r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00);
> - r8168_mac_ocp_write(tp, 0xF8C8, 0xC707);
> - r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00);
> - r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2);
> - r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1);
> - r8168_mac_ocp_write(tp, 0xF8D0, 0xC502);
> - r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00);
> - r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA);
> - r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0);
> - r8168_mac_ocp_write(tp, 0xF8D8, 0xC502);
> - r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00);
> - r8168_mac_ocp_write(tp, 0xF8DC, 0x0132);
> + r8168_mac_ocp_write_seq(tp, recover_seq);
>
> r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>
> - r8168_mac_ocp_write(tp, 0xFC2A, 0x0743);
> - r8168_mac_ocp_write(tp, 0xFC2C, 0x0801);
> - r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9);
> - r8168_mac_ocp_write(tp, 0xFC30, 0x02FD);
> - r8168_mac_ocp_write(tp, 0xFC32, 0x0C25);
> - r8168_mac_ocp_write(tp, 0xFC34, 0x00A9);
> - r8168_mac_ocp_write(tp, 0xFC36, 0x012D);
> + r8168_mac_ocp_write_seq(tp, final_seq);
> +
> }
>
> static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)

2023-10-29 04:56:59

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls



On 10/28/23 21:21, Heiner Kallweit wrote:
> On 28.10.2023 13:05, Mirsad Goran Todorovac wrote:
>> On RTL8411b the RX unit gets confused if the PHY is powered-down.
>> This was reported in [0] and confirmed by Realtek. Realtek provided
>> a sequence to fix the RX unit after PHY wakeup.
>>
>> A series of about 130 r8168_mac_ocp_write() calls is performed to
>> program the RTL registers for recovery.
>>
>> r8168_mac_ocp_write() expands to this code:
>>
>> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>> {
>> if (rtl_ocp_reg_failure(reg))
>> return;
>>
>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>> }
>>
>> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>> {
>> unsigned long flags;
>>
>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>> __r8168_mac_ocp_write(tp, reg, data);
>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>> }
>>
>> Register programming is done through RTL_W32() macro which expands into
>>
>> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>
>> which is further (on Alpha):
>>
>> extern inline void writel(u32 b, volatile void __iomem *addr)
>> {
>> mb();
>> __raw_writel(b, addr);
>> }
>>
>> or on i386/x86_64:
>>
>> #define build_mmio_write(name, size, type, reg, barrier) \
>> static inline void name(type val, volatile void __iomem *addr) \
>> { asm volatile("mov" size " %0,%1": :reg (val), \
>> "m" (*(volatile type __force *)addr) barrier); }
>>
>> build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>
>> This obviously involves iat least a compiler barrier.
>>
>> mb() expands into something like this i.e. on x86_64:
>>
>> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>
>> This means a whole lot of memory bus barriers: for spin_lock_irqsave(),
>> memory barrier, writel(), and spin_unlock_irqrestore().
>>
>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>> a LOCK storm that will thunder all of the cores and CPUs on the same memory controller
>> for certain time that locked memory read-modify-write cyclo or I/O takes to finish.
>>
>> In a sequential case of RTL register programming, the writes to RTL registers
>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>> number of bus stalls in a multicore or multi-CPU system:
>>
>> static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>> const struct recover_8411b_info *array)
>> {
>> struct recover_8411b_info const *p = array;
>>
>> while (p->reg) {
>> if (!rtl_ocp_reg_failure(p->reg))
>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (p->reg << 15) | p->data);
>> p++;
>> }
>> }
>>
>> static void r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>> const struct recover_8411b_info *array)
>> {
>> unsigned long flags;
>>
>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>> __r8168_mac_ocp_write_seq(tp, array);
>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>> }
>>
>> static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>> {
>>
>> ...
>>
>> /* The following Realtek-provided magic fixes an issue with the RX unit
>> * getting confused after the PHY having been powered-down.
>> */
>>
>> static const struct recover_8411b_info init_zero_seq[] = {
>> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>> ...
>> };
>>
>> static const struct recover_8411b_info recover_seq[] = {
>> { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>> ...
>> };
>>
>> static const struct recover_8411b_info final_seq[] = {
>> { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>> ...
>> };
>>
>> r8168_mac_ocp_write_seq(tp, init_zero_seq);
>> mdelay(3);
>> r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>> r8168_mac_ocp_write_seq(tp, recover_seq);
>> r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>> r8168_mac_ocp_write_seq(tp, final_seq);
>> }
>>
>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>> functions that only changed the function names and the ending of the line, so the actual
>> hex data is unchanged.
>>
>> Note that the original reason for the introduction of the commit fe4e8db0392a6
>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>
> I still have a problem with this statement as you're saying that the original
> problem still exists. I don't think that's the case.

I will not disagree about it.

But we have only reduced the number of spin_lock_irqsave/spin_unlock_irqrestore()
pairs.

Maybe additionally, on the low level, memory barrier isn't required for each write to
MMIO?

If it still uses a LOCK addl $0, m32/m64, then it still creates 130 instances of all core
bus locks for this NIC reset after the lost PHY? I'm just thinking, this is nothing
authoritative ...

>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>> the programming sequence to reach RTL NIC registers.
>>
>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>
>> Fixes: fe4e8db0392a6 ("r8169: fix issue with confused RX unit after PHY power-down on RTL8411b")
>> Cc: Heiner Kallweit <[email protected]>
>> Cc: Marco Elver <[email protected]>
>> Cc: [email protected]
>> Cc: "David S. Miller" <[email protected]>
>> Cc: Eric Dumazet <[email protected]>
>> Cc: Jakub Kicinski <[email protected]>
>> Cc: Paolo Abeni <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
>> ---
>> v3:
>> removed register/mask pair array sentinels, so using ARRAY_SIZE().
>> avoided duplication of RTL_W32() call code as advised by Heiner.
>>
>> drivers/net/ethernet/realtek/r8169_main.c | 198 ++++++++--------------
>> 1 file changed, 72 insertions(+), 126 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>> index 361b90007148..3b28bec7098b 100644
>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -939,6 +939,32 @@ static void r8168_mac_ocp_modify(struct rtl8169_private *tp, u32 reg, u16 mask,
>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>> }
>>
>> +struct e_info_regmask_pair {
>> + u32 reg;
>> + u32 data;
>> +};
>> +
>> +static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>> + const struct e_info_regmask_pair *array, int len)
>> +{
>> + struct e_info_regmask_pair const *p;
>> +
>> + for (p = array; len--; p++)
>> + __r8168_mac_ocp_write(tp, p->reg, p->data);
>> +}
>> +
>> +static void _r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>> + const struct e_info_regmask_pair *array, int len)
>> +{
>> + unsigned long flags;
>> +
>> + raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>> + __r8168_mac_ocp_write_seq(tp, array, len);
>> + raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>> +}
>> +
>> +#define r8168_mac_ocp_write_seq(tp, a) _r8168_mac_ocp_write_seq(tp, a, ARRAY_SIZE(a))
>> +
>> /* Work around a hw issue with RTL8168g PHY, the quirk disables
>> * PHY MCU interrupts before PHY power-down.
>> */
>> @@ -3107,138 +3133,58 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>> /* The following Realtek-provided magic fixes an issue with the RX unit
>> * getting confused after the PHY having been powered-down.
>> */
>> - r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
>> - r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
>> - r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
>> - r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
>> - r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
>> - r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
>> - r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
>> - r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
>> +
>> + static const struct e_info_regmask_pair init_zero_seq[] = {
>> + { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>> + { 0xFC30, 0x0000 }, { 0xFC32, 0x0000 }, { 0xFC34, 0x0000 }, { 0xFC36, 0x0000 },
>> + };
>> +
> Don't mix code and variable declarations. Did you run checkpatch?
> I think it would complain here.

Thank you for the warning, I will fix it.

As I said to Mr. Greg, I will do the required number of iterations to fix this issue.

I will add checkpatch to my routine handling of my submitted patches.

Thanks,
Mirsad

>> + static const struct e_info_regmask_pair recover_seq[] = {
>> + { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>> + { 0xF808, 0xE027 }, { 0xF80A, 0xE04F }, { 0xF80C, 0xE05E }, { 0xF80E, 0xE065 },
>> + { 0xF810, 0xC602 }, { 0xF812, 0xBE00 }, { 0xF814, 0x0000 }, { 0xF816, 0xC502 },
>> + { 0xF818, 0xBD00 }, { 0xF81A, 0x074C }, { 0xF81C, 0xC302 }, { 0xF81E, 0xBB00 },
>> + { 0xF820, 0x080A }, { 0xF822, 0x6420 }, { 0xF824, 0x48C2 }, { 0xF826, 0x8C20 },
>> + { 0xF828, 0xC516 }, { 0xF82A, 0x64A4 }, { 0xF82C, 0x49C0 }, { 0xF82E, 0xF009 },
>> + { 0xF830, 0x74A2 }, { 0xF832, 0x8CA5 }, { 0xF834, 0x74A0 }, { 0xF836, 0xC50E },
>> + { 0xF838, 0x9CA2 }, { 0xF83A, 0x1C11 }, { 0xF83C, 0x9CA0 }, { 0xF83E, 0xE006 },
>> + { 0xF840, 0x74F8 }, { 0xF842, 0x48C4 }, { 0xF844, 0x8CF8 }, { 0xF846, 0xC404 },
>> + { 0xF848, 0xBC00 }, { 0xF84A, 0xC403 }, { 0xF84C, 0xBC00 }, { 0xF84E, 0x0BF2 },
>> + { 0xF850, 0x0C0A }, { 0xF852, 0xE434 }, { 0xF854, 0xD3C0 }, { 0xF856, 0x49D9 },
>> + { 0xF858, 0xF01F }, { 0xF85A, 0xC526 }, { 0xF85C, 0x64A5 }, { 0xF85E, 0x1400 },
>> + { 0xF860, 0xF007 }, { 0xF862, 0x0C01 }, { 0xF864, 0x8CA5 }, { 0xF866, 0x1C15 },
>> + { 0xF868, 0xC51B }, { 0xF86A, 0x9CA0 }, { 0xF86C, 0xE013 }, { 0xF86E, 0xC519 },
>> + { 0xF870, 0x74A0 }, { 0xF872, 0x48C4 }, { 0xF874, 0x8CA0 }, { 0xF876, 0xC516 },
>> + { 0xF878, 0x74A4 }, { 0xF87A, 0x48C8 }, { 0xF87C, 0x48CA }, { 0xF87E, 0x9CA4 },
>> + { 0xF880, 0xC512 }, { 0xF882, 0x1B00 }, { 0xF884, 0x9BA0 }, { 0xF886, 0x1B1C },
>> + { 0xF888, 0x483F }, { 0xF88A, 0x9BA2 }, { 0xF88C, 0x1B04 }, { 0xF88E, 0xC508 },
>> + { 0xF890, 0x9BA0 }, { 0xF892, 0xC505 }, { 0xF894, 0xBD00 }, { 0xF896, 0xC502 },
>> + { 0xF898, 0xBD00 }, { 0xF89A, 0x0300 }, { 0xF89C, 0x051E }, { 0xF89E, 0xE434 },
>> + { 0xF8A0, 0xE018 }, { 0xF8A2, 0xE092 }, { 0xF8A4, 0xDE20 }, { 0xF8A6, 0xD3C0 },
>> + { 0xF8A8, 0xC50F }, { 0xF8AA, 0x76A4 }, { 0xF8AC, 0x49E3 }, { 0xF8AE, 0xF007 },
>> + { 0xF8B0, 0x49C0 }, { 0xF8B2, 0xF103 }, { 0xF8B4, 0xC607 }, { 0xF8B6, 0xBE00 },
>> + { 0xF8B8, 0xC606 }, { 0xF8BA, 0xBE00 }, { 0xF8BC, 0xC602 }, { 0xF8BE, 0xBE00 },
>> + { 0xF8C0, 0x0C4C }, { 0xF8C2, 0x0C28 }, { 0xF8C4, 0x0C2C }, { 0xF8C6, 0xDC00 },
>> + { 0xF8C8, 0xC707 }, { 0xF8CA, 0x1D00 }, { 0xF8CC, 0x8DE2 }, { 0xF8CE, 0x48C1 },
>> + { 0xF8D0, 0xC502 }, { 0xF8D2, 0xBD00 }, { 0xF8D4, 0x00AA }, { 0xF8D6, 0xE0C0 },
>> + { 0xF8D8, 0xC502 }, { 0xF8DA, 0xBD00 }, { 0xF8DC, 0x0132 },
>> + };
>> +
>> + static const struct e_info_regmask_pair final_seq[] = {
>> + { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>> + { 0xFC32, 0x0C25 }, { 0xFC34, 0x00A9 }, { 0xFC36, 0x012D },
>> + };
>> +
>> + r8168_mac_ocp_write_seq(tp, init_zero_seq);
>> mdelay(3);
>> r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>
>> - r8168_mac_ocp_write(tp, 0xF800, 0xE008);
>> - r8168_mac_ocp_write(tp, 0xF802, 0xE00A);
>> - r8168_mac_ocp_write(tp, 0xF804, 0xE00C);
>> - r8168_mac_ocp_write(tp, 0xF806, 0xE00E);
>> - r8168_mac_ocp_write(tp, 0xF808, 0xE027);
>> - r8168_mac_ocp_write(tp, 0xF80A, 0xE04F);
>> - r8168_mac_ocp_write(tp, 0xF80C, 0xE05E);
>> - r8168_mac_ocp_write(tp, 0xF80E, 0xE065);
>> - r8168_mac_ocp_write(tp, 0xF810, 0xC602);
>> - r8168_mac_ocp_write(tp, 0xF812, 0xBE00);
>> - r8168_mac_ocp_write(tp, 0xF814, 0x0000);
>> - r8168_mac_ocp_write(tp, 0xF816, 0xC502);
>> - r8168_mac_ocp_write(tp, 0xF818, 0xBD00);
>> - r8168_mac_ocp_write(tp, 0xF81A, 0x074C);
>> - r8168_mac_ocp_write(tp, 0xF81C, 0xC302);
>> - r8168_mac_ocp_write(tp, 0xF81E, 0xBB00);
>> - r8168_mac_ocp_write(tp, 0xF820, 0x080A);
>> - r8168_mac_ocp_write(tp, 0xF822, 0x6420);
>> - r8168_mac_ocp_write(tp, 0xF824, 0x48C2);
>> - r8168_mac_ocp_write(tp, 0xF826, 0x8C20);
>> - r8168_mac_ocp_write(tp, 0xF828, 0xC516);
>> - r8168_mac_ocp_write(tp, 0xF82A, 0x64A4);
>> - r8168_mac_ocp_write(tp, 0xF82C, 0x49C0);
>> - r8168_mac_ocp_write(tp, 0xF82E, 0xF009);
>> - r8168_mac_ocp_write(tp, 0xF830, 0x74A2);
>> - r8168_mac_ocp_write(tp, 0xF832, 0x8CA5);
>> - r8168_mac_ocp_write(tp, 0xF834, 0x74A0);
>> - r8168_mac_ocp_write(tp, 0xF836, 0xC50E);
>> - r8168_mac_ocp_write(tp, 0xF838, 0x9CA2);
>> - r8168_mac_ocp_write(tp, 0xF83A, 0x1C11);
>> - r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0);
>> - r8168_mac_ocp_write(tp, 0xF83E, 0xE006);
>> - r8168_mac_ocp_write(tp, 0xF840, 0x74F8);
>> - r8168_mac_ocp_write(tp, 0xF842, 0x48C4);
>> - r8168_mac_ocp_write(tp, 0xF844, 0x8CF8);
>> - r8168_mac_ocp_write(tp, 0xF846, 0xC404);
>> - r8168_mac_ocp_write(tp, 0xF848, 0xBC00);
>> - r8168_mac_ocp_write(tp, 0xF84A, 0xC403);
>> - r8168_mac_ocp_write(tp, 0xF84C, 0xBC00);
>> - r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2);
>> - r8168_mac_ocp_write(tp, 0xF850, 0x0C0A);
>> - r8168_mac_ocp_write(tp, 0xF852, 0xE434);
>> - r8168_mac_ocp_write(tp, 0xF854, 0xD3C0);
>> - r8168_mac_ocp_write(tp, 0xF856, 0x49D9);
>> - r8168_mac_ocp_write(tp, 0xF858, 0xF01F);
>> - r8168_mac_ocp_write(tp, 0xF85A, 0xC526);
>> - r8168_mac_ocp_write(tp, 0xF85C, 0x64A5);
>> - r8168_mac_ocp_write(tp, 0xF85E, 0x1400);
>> - r8168_mac_ocp_write(tp, 0xF860, 0xF007);
>> - r8168_mac_ocp_write(tp, 0xF862, 0x0C01);
>> - r8168_mac_ocp_write(tp, 0xF864, 0x8CA5);
>> - r8168_mac_ocp_write(tp, 0xF866, 0x1C15);
>> - r8168_mac_ocp_write(tp, 0xF868, 0xC51B);
>> - r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0);
>> - r8168_mac_ocp_write(tp, 0xF86C, 0xE013);
>> - r8168_mac_ocp_write(tp, 0xF86E, 0xC519);
>> - r8168_mac_ocp_write(tp, 0xF870, 0x74A0);
>> - r8168_mac_ocp_write(tp, 0xF872, 0x48C4);
>> - r8168_mac_ocp_write(tp, 0xF874, 0x8CA0);
>> - r8168_mac_ocp_write(tp, 0xF876, 0xC516);
>> - r8168_mac_ocp_write(tp, 0xF878, 0x74A4);
>> - r8168_mac_ocp_write(tp, 0xF87A, 0x48C8);
>> - r8168_mac_ocp_write(tp, 0xF87C, 0x48CA);
>> - r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4);
>> - r8168_mac_ocp_write(tp, 0xF880, 0xC512);
>> - r8168_mac_ocp_write(tp, 0xF882, 0x1B00);
>> - r8168_mac_ocp_write(tp, 0xF884, 0x9BA0);
>> - r8168_mac_ocp_write(tp, 0xF886, 0x1B1C);
>> - r8168_mac_ocp_write(tp, 0xF888, 0x483F);
>> - r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2);
>> - r8168_mac_ocp_write(tp, 0xF88C, 0x1B04);
>> - r8168_mac_ocp_write(tp, 0xF88E, 0xC508);
>> - r8168_mac_ocp_write(tp, 0xF890, 0x9BA0);
>> - r8168_mac_ocp_write(tp, 0xF892, 0xC505);
>> - r8168_mac_ocp_write(tp, 0xF894, 0xBD00);
>> - r8168_mac_ocp_write(tp, 0xF896, 0xC502);
>> - r8168_mac_ocp_write(tp, 0xF898, 0xBD00);
>> - r8168_mac_ocp_write(tp, 0xF89A, 0x0300);
>> - r8168_mac_ocp_write(tp, 0xF89C, 0x051E);
>> - r8168_mac_ocp_write(tp, 0xF89E, 0xE434);
>> - r8168_mac_ocp_write(tp, 0xF8A0, 0xE018);
>> - r8168_mac_ocp_write(tp, 0xF8A2, 0xE092);
>> - r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20);
>> - r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0);
>> - r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F);
>> - r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4);
>> - r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3);
>> - r8168_mac_ocp_write(tp, 0xF8AE, 0xF007);
>> - r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0);
>> - r8168_mac_ocp_write(tp, 0xF8B2, 0xF103);
>> - r8168_mac_ocp_write(tp, 0xF8B4, 0xC607);
>> - r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00);
>> - r8168_mac_ocp_write(tp, 0xF8B8, 0xC606);
>> - r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00);
>> - r8168_mac_ocp_write(tp, 0xF8BC, 0xC602);
>> - r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00);
>> - r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C);
>> - r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28);
>> - r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C);
>> - r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00);
>> - r8168_mac_ocp_write(tp, 0xF8C8, 0xC707);
>> - r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00);
>> - r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2);
>> - r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1);
>> - r8168_mac_ocp_write(tp, 0xF8D0, 0xC502);
>> - r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00);
>> - r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA);
>> - r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0);
>> - r8168_mac_ocp_write(tp, 0xF8D8, 0xC502);
>> - r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00);
>> - r8168_mac_ocp_write(tp, 0xF8DC, 0x0132);
>> + r8168_mac_ocp_write_seq(tp, recover_seq);
>>
>> r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>
>> - r8168_mac_ocp_write(tp, 0xFC2A, 0x0743);
>> - r8168_mac_ocp_write(tp, 0xFC2C, 0x0801);
>> - r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9);
>> - r8168_mac_ocp_write(tp, 0xFC30, 0x02FD);
>> - r8168_mac_ocp_write(tp, 0xFC32, 0x0C25);
>> - r8168_mac_ocp_write(tp, 0xFC34, 0x00A9);
>> - r8168_mac_ocp_write(tp, 0xFC36, 0x012D);
>> + r8168_mac_ocp_write_seq(tp, final_seq);
>> +
>> }
>>
>> static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)

2023-10-30 13:18:15

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

On 29.10.2023 05:56, Mirsad Todorovac wrote:
>
>
> On 10/28/23 21:21, Heiner Kallweit wrote:
>> On 28.10.2023 13:05, Mirsad Goran Todorovac wrote:
>>> On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>> This was reported in [0] and confirmed by Realtek. Realtek provided
>>> a sequence to fix the RX unit after PHY wakeup.
>>>
>>> A series of about 130 r8168_mac_ocp_write() calls is performed to
>>> program the RTL registers for recovery.
>>>
>>> r8168_mac_ocp_write() expands to this code:
>>>
>>>          static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>          {
>>>                  if (rtl_ocp_reg_failure(reg))
>>>                          return;
>>>
>>>                  RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>          }
>>>
>>>          static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>          {
>>>                  unsigned long flags;
>>>
>>>                  raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>                  __r8168_mac_ocp_write(tp, reg, data);
>>>                  raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>          }
>>>
>>> Register programming is done through RTL_W32() macro which expands into
>>>
>>>          #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>
>>> which is further (on Alpha):
>>>
>>>          extern inline void writel(u32 b, volatile void __iomem *addr)
>>>          {
>>>                  mb();
>>>                  __raw_writel(b, addr);
>>>          }
>>>
>>> or on i386/x86_64:
>>>
>>>      #define build_mmio_write(name, size, type, reg, barrier) \
>>>      static inline void name(type val, volatile void __iomem *addr) \
>>>      { asm volatile("mov" size " %0,%1": :reg (val), \
>>>      "m" (*(volatile type __force *)addr) barrier); }
>>>
>>>      build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>
>>> This obviously involves iat least a compiler barrier.
>>>
>>> mb() expands into something like this i.e. on x86_64:
>>>
>>>          #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>
>>> This means a whole lot of memory bus barriers: for spin_lock_irqsave(),
>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>
>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>> a LOCK storm that will thunder all of the cores and CPUs on the same memory controller
>>> for certain time that locked memory read-modify-write cyclo or I/O takes to finish.
>>>
>>> In a sequential case of RTL register programming, the writes to RTL registers
>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>> number of bus stalls in a multicore or multi-CPU system:
>>>
>>>          static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>                                                const struct recover_8411b_info *array)
>>>          {
>>>                  struct recover_8411b_info const *p = array;
>>>
>>>                  while (p->reg) {
>>>                          if (!rtl_ocp_reg_failure(p->reg))
>>>                                  RTL_W32(tp, OCPDR, OCPAR_FLAG | (p->reg << 15) | p->data);
>>>                          p++;
>>>                  }
>>>          }
>>>
>>>          static void r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>                                              const struct recover_8411b_info *array)
>>>          {
>>>                  unsigned long flags;
>>>
>>>                  raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>                  __r8168_mac_ocp_write_seq(tp, array);
>>>                  raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>          }
>>>
>>>          static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>          {
>>>
>>>                  ...
>>>
>>>                  /* The following Realtek-provided magic fixes an issue with the RX unit
>>>                   * getting confused after the PHY having been powered-down.
>>>                   */
>>>
>>>                  static const struct recover_8411b_info init_zero_seq[] = {
>>>                          { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>>>             ...
>>>                  };
>>>
>>>                  static const struct recover_8411b_info recover_seq[] = {
>>>                          { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>>>             ...
>>>                  };
>>>
>>>                  static const struct recover_8411b_info final_seq[] = {
>>>                          { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>>>             ...
>>>                  };
>>>
>>>                  r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>                  mdelay(3);
>>>                  r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>>                  r8168_mac_ocp_write_seq(tp, recover_seq);
>>>                  r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>>                  r8168_mac_ocp_write_seq(tp, final_seq);
>>>          }
>>>
>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>> functions that only changed the function names and the ending of the line, so the actual
>>> hex data is unchanged.
>>>
>>> Note that the original reason for the introduction of the commit fe4e8db0392a6
>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>
>> I still have a problem with this statement as you're saying that the original
>> problem still exists. I don't think that's the case.
>
> I will not disagree about it.
>
> But we have only reduced the number of spin_lock_irqsave/spin_unlock_irqrestore()
> pairs.
>
> Maybe additionally, on the low level, memory barrier isn't required for each write to
> MMIO?
>
One could argue whether in several places writel_relaxed() could be used.
But it's not really worth it, because we're not in a hot path.

> If it still uses a LOCK addl $0, m32/m64, then it still creates 130 instances of all core
> bus locks for this NIC reset after the lost PHY? I'm just thinking, this is nothing
> authoritative ...
>
>>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>>> the programming sequence to reach RTL NIC registers.
>>>
>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>>
>>> Fixes: fe4e8db0392a6 ("r8169: fix issue with confused RX unit after PHY power-down on RTL8411b")
>>> Cc: Heiner Kallweit <[email protected]>
>>> Cc: Marco Elver <[email protected]>
>>> Cc: [email protected]
>>> Cc: "David S. Miller" <[email protected]>
>>> Cc: Eric Dumazet <[email protected]>
>>> Cc: Jakub Kicinski <[email protected]>
>>> Cc: Paolo Abeni <[email protected]>
>>> Cc: [email protected]
>>> Cc: [email protected]
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
>>> ---
>>> v3:
>>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>>
>>>   drivers/net/ethernet/realtek/r8169_main.c | 198 ++++++++--------------
>>>   1 file changed, 72 insertions(+), 126 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>> index 361b90007148..3b28bec7098b 100644
>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>> @@ -939,6 +939,32 @@ static void r8168_mac_ocp_modify(struct rtl8169_private *tp, u32 reg, u16 mask,
>>>       raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>   }
>>>   +struct e_info_regmask_pair {
>>> +    u32    reg;
>>> +    u32    data;
>>> +};
>>> +
>>> +static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>> +                      const struct e_info_regmask_pair *array, int len)
>>> +{
>>> +    struct e_info_regmask_pair const *p;
>>> +
>>> +    for (p = array; len--; p++)
>>> +        __r8168_mac_ocp_write(tp, p->reg, p->data);
>>> +}
>>> +
>>> +static void _r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>> +                     const struct e_info_regmask_pair *array, int len)
>>> +{
>>> +    unsigned long flags;
>>> +
>>> +    raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>> +    __r8168_mac_ocp_write_seq(tp, array, len);
>>> +    raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>> +}
>>> +
>>> +#define r8168_mac_ocp_write_seq(tp, a) _r8168_mac_ocp_write_seq(tp, a, ARRAY_SIZE(a))
>>> +
>>>   /* Work around a hw issue with RTL8168g PHY, the quirk disables
>>>    * PHY MCU interrupts before PHY power-down.
>>>    */
>>> @@ -3107,138 +3133,58 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>       /* The following Realtek-provided magic fixes an issue with the RX unit
>>>        * getting confused after the PHY having been powered-down.
>>>        */
>>> -    r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
>>> -    r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
>>> -    r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
>>> -    r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
>>> -    r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
>>> -    r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
>>> -    r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
>>> -    r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
>>> +
>>> +    static const struct e_info_regmask_pair init_zero_seq[] = {
>>> +        { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>>> +        { 0xFC30, 0x0000 }, { 0xFC32, 0x0000 }, { 0xFC34, 0x0000 }, { 0xFC36, 0x0000 },
>>> +    };
>>> +
>> Don't mix code and variable declarations. Did you run checkpatch?
>> I think it would complain here.
>
> Thank you for the warning, I will fix it.
>
> As I said to Mr. Greg, I will do the required number of iterations to fix this issue.
>
> I will add checkpatch to my routine handling of my submitted patches.
>
> Thanks,
> Mirsad
>
>>> +    static const struct e_info_regmask_pair recover_seq[] = {
>>> +        { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>>> +        { 0xF808, 0xE027 }, { 0xF80A, 0xE04F }, { 0xF80C, 0xE05E }, { 0xF80E, 0xE065 },
>>> +        { 0xF810, 0xC602 }, { 0xF812, 0xBE00 }, { 0xF814, 0x0000 }, { 0xF816, 0xC502 },
>>> +        { 0xF818, 0xBD00 }, { 0xF81A, 0x074C }, { 0xF81C, 0xC302 }, { 0xF81E, 0xBB00 },
>>> +        { 0xF820, 0x080A }, { 0xF822, 0x6420 }, { 0xF824, 0x48C2 }, { 0xF826, 0x8C20 },
>>> +        { 0xF828, 0xC516 }, { 0xF82A, 0x64A4 }, { 0xF82C, 0x49C0 }, { 0xF82E, 0xF009 },
>>> +        { 0xF830, 0x74A2 }, { 0xF832, 0x8CA5 }, { 0xF834, 0x74A0 }, { 0xF836, 0xC50E },
>>> +        { 0xF838, 0x9CA2 }, { 0xF83A, 0x1C11 }, { 0xF83C, 0x9CA0 }, { 0xF83E, 0xE006 },
>>> +        { 0xF840, 0x74F8 }, { 0xF842, 0x48C4 }, { 0xF844, 0x8CF8 }, { 0xF846, 0xC404 },
>>> +        { 0xF848, 0xBC00 }, { 0xF84A, 0xC403 }, { 0xF84C, 0xBC00 }, { 0xF84E, 0x0BF2 },
>>> +        { 0xF850, 0x0C0A }, { 0xF852, 0xE434 }, { 0xF854, 0xD3C0 }, { 0xF856, 0x49D9 },
>>> +        { 0xF858, 0xF01F }, { 0xF85A, 0xC526 }, { 0xF85C, 0x64A5 }, { 0xF85E, 0x1400 },
>>> +        { 0xF860, 0xF007 }, { 0xF862, 0x0C01 }, { 0xF864, 0x8CA5 }, { 0xF866, 0x1C15 },
>>> +        { 0xF868, 0xC51B }, { 0xF86A, 0x9CA0 }, { 0xF86C, 0xE013 }, { 0xF86E, 0xC519 },
>>> +        { 0xF870, 0x74A0 }, { 0xF872, 0x48C4 }, { 0xF874, 0x8CA0 }, { 0xF876, 0xC516 },
>>> +        { 0xF878, 0x74A4 }, { 0xF87A, 0x48C8 }, { 0xF87C, 0x48CA }, { 0xF87E, 0x9CA4 },
>>> +        { 0xF880, 0xC512 }, { 0xF882, 0x1B00 }, { 0xF884, 0x9BA0 }, { 0xF886, 0x1B1C },
>>> +        { 0xF888, 0x483F }, { 0xF88A, 0x9BA2 }, { 0xF88C, 0x1B04 }, { 0xF88E, 0xC508 },
>>> +        { 0xF890, 0x9BA0 }, { 0xF892, 0xC505 }, { 0xF894, 0xBD00 }, { 0xF896, 0xC502 },
>>> +        { 0xF898, 0xBD00 }, { 0xF89A, 0x0300 }, { 0xF89C, 0x051E }, { 0xF89E, 0xE434 },
>>> +        { 0xF8A0, 0xE018 }, { 0xF8A2, 0xE092 }, { 0xF8A4, 0xDE20 }, { 0xF8A6, 0xD3C0 },
>>> +        { 0xF8A8, 0xC50F }, { 0xF8AA, 0x76A4 }, { 0xF8AC, 0x49E3 }, { 0xF8AE, 0xF007 },
>>> +        { 0xF8B0, 0x49C0 }, { 0xF8B2, 0xF103 }, { 0xF8B4, 0xC607 }, { 0xF8B6, 0xBE00 },
>>> +        { 0xF8B8, 0xC606 }, { 0xF8BA, 0xBE00 }, { 0xF8BC, 0xC602 }, { 0xF8BE, 0xBE00 },
>>> +        { 0xF8C0, 0x0C4C }, { 0xF8C2, 0x0C28 }, { 0xF8C4, 0x0C2C }, { 0xF8C6, 0xDC00 },
>>> +        { 0xF8C8, 0xC707 }, { 0xF8CA, 0x1D00 }, { 0xF8CC, 0x8DE2 }, { 0xF8CE, 0x48C1 },
>>> +        { 0xF8D0, 0xC502 }, { 0xF8D2, 0xBD00 }, { 0xF8D4, 0x00AA }, { 0xF8D6, 0xE0C0 },
>>> +        { 0xF8D8, 0xC502 }, { 0xF8DA, 0xBD00 }, { 0xF8DC, 0x0132 },
>>> +    };
>>> +
>>> +    static const struct e_info_regmask_pair final_seq[] = {
>>> +        { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>>> +        { 0xFC32, 0x0C25 }, { 0xFC34, 0x00A9 }, { 0xFC36, 0x012D },
>>> +    };
>>> +
>>> +    r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>       mdelay(3);
>>>       r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>>   -    r8168_mac_ocp_write(tp, 0xF800, 0xE008);
>>> -    r8168_mac_ocp_write(tp, 0xF802, 0xE00A);
>>> -    r8168_mac_ocp_write(tp, 0xF804, 0xE00C);
>>> -    r8168_mac_ocp_write(tp, 0xF806, 0xE00E);
>>> -    r8168_mac_ocp_write(tp, 0xF808, 0xE027);
>>> -    r8168_mac_ocp_write(tp, 0xF80A, 0xE04F);
>>> -    r8168_mac_ocp_write(tp, 0xF80C, 0xE05E);
>>> -    r8168_mac_ocp_write(tp, 0xF80E, 0xE065);
>>> -    r8168_mac_ocp_write(tp, 0xF810, 0xC602);
>>> -    r8168_mac_ocp_write(tp, 0xF812, 0xBE00);
>>> -    r8168_mac_ocp_write(tp, 0xF814, 0x0000);
>>> -    r8168_mac_ocp_write(tp, 0xF816, 0xC502);
>>> -    r8168_mac_ocp_write(tp, 0xF818, 0xBD00);
>>> -    r8168_mac_ocp_write(tp, 0xF81A, 0x074C);
>>> -    r8168_mac_ocp_write(tp, 0xF81C, 0xC302);
>>> -    r8168_mac_ocp_write(tp, 0xF81E, 0xBB00);
>>> -    r8168_mac_ocp_write(tp, 0xF820, 0x080A);
>>> -    r8168_mac_ocp_write(tp, 0xF822, 0x6420);
>>> -    r8168_mac_ocp_write(tp, 0xF824, 0x48C2);
>>> -    r8168_mac_ocp_write(tp, 0xF826, 0x8C20);
>>> -    r8168_mac_ocp_write(tp, 0xF828, 0xC516);
>>> -    r8168_mac_ocp_write(tp, 0xF82A, 0x64A4);
>>> -    r8168_mac_ocp_write(tp, 0xF82C, 0x49C0);
>>> -    r8168_mac_ocp_write(tp, 0xF82E, 0xF009);
>>> -    r8168_mac_ocp_write(tp, 0xF830, 0x74A2);
>>> -    r8168_mac_ocp_write(tp, 0xF832, 0x8CA5);
>>> -    r8168_mac_ocp_write(tp, 0xF834, 0x74A0);
>>> -    r8168_mac_ocp_write(tp, 0xF836, 0xC50E);
>>> -    r8168_mac_ocp_write(tp, 0xF838, 0x9CA2);
>>> -    r8168_mac_ocp_write(tp, 0xF83A, 0x1C11);
>>> -    r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0);
>>> -    r8168_mac_ocp_write(tp, 0xF83E, 0xE006);
>>> -    r8168_mac_ocp_write(tp, 0xF840, 0x74F8);
>>> -    r8168_mac_ocp_write(tp, 0xF842, 0x48C4);
>>> -    r8168_mac_ocp_write(tp, 0xF844, 0x8CF8);
>>> -    r8168_mac_ocp_write(tp, 0xF846, 0xC404);
>>> -    r8168_mac_ocp_write(tp, 0xF848, 0xBC00);
>>> -    r8168_mac_ocp_write(tp, 0xF84A, 0xC403);
>>> -    r8168_mac_ocp_write(tp, 0xF84C, 0xBC00);
>>> -    r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2);
>>> -    r8168_mac_ocp_write(tp, 0xF850, 0x0C0A);
>>> -    r8168_mac_ocp_write(tp, 0xF852, 0xE434);
>>> -    r8168_mac_ocp_write(tp, 0xF854, 0xD3C0);
>>> -    r8168_mac_ocp_write(tp, 0xF856, 0x49D9);
>>> -    r8168_mac_ocp_write(tp, 0xF858, 0xF01F);
>>> -    r8168_mac_ocp_write(tp, 0xF85A, 0xC526);
>>> -    r8168_mac_ocp_write(tp, 0xF85C, 0x64A5);
>>> -    r8168_mac_ocp_write(tp, 0xF85E, 0x1400);
>>> -    r8168_mac_ocp_write(tp, 0xF860, 0xF007);
>>> -    r8168_mac_ocp_write(tp, 0xF862, 0x0C01);
>>> -    r8168_mac_ocp_write(tp, 0xF864, 0x8CA5);
>>> -    r8168_mac_ocp_write(tp, 0xF866, 0x1C15);
>>> -    r8168_mac_ocp_write(tp, 0xF868, 0xC51B);
>>> -    r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0);
>>> -    r8168_mac_ocp_write(tp, 0xF86C, 0xE013);
>>> -    r8168_mac_ocp_write(tp, 0xF86E, 0xC519);
>>> -    r8168_mac_ocp_write(tp, 0xF870, 0x74A0);
>>> -    r8168_mac_ocp_write(tp, 0xF872, 0x48C4);
>>> -    r8168_mac_ocp_write(tp, 0xF874, 0x8CA0);
>>> -    r8168_mac_ocp_write(tp, 0xF876, 0xC516);
>>> -    r8168_mac_ocp_write(tp, 0xF878, 0x74A4);
>>> -    r8168_mac_ocp_write(tp, 0xF87A, 0x48C8);
>>> -    r8168_mac_ocp_write(tp, 0xF87C, 0x48CA);
>>> -    r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4);
>>> -    r8168_mac_ocp_write(tp, 0xF880, 0xC512);
>>> -    r8168_mac_ocp_write(tp, 0xF882, 0x1B00);
>>> -    r8168_mac_ocp_write(tp, 0xF884, 0x9BA0);
>>> -    r8168_mac_ocp_write(tp, 0xF886, 0x1B1C);
>>> -    r8168_mac_ocp_write(tp, 0xF888, 0x483F);
>>> -    r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2);
>>> -    r8168_mac_ocp_write(tp, 0xF88C, 0x1B04);
>>> -    r8168_mac_ocp_write(tp, 0xF88E, 0xC508);
>>> -    r8168_mac_ocp_write(tp, 0xF890, 0x9BA0);
>>> -    r8168_mac_ocp_write(tp, 0xF892, 0xC505);
>>> -    r8168_mac_ocp_write(tp, 0xF894, 0xBD00);
>>> -    r8168_mac_ocp_write(tp, 0xF896, 0xC502);
>>> -    r8168_mac_ocp_write(tp, 0xF898, 0xBD00);
>>> -    r8168_mac_ocp_write(tp, 0xF89A, 0x0300);
>>> -    r8168_mac_ocp_write(tp, 0xF89C, 0x051E);
>>> -    r8168_mac_ocp_write(tp, 0xF89E, 0xE434);
>>> -    r8168_mac_ocp_write(tp, 0xF8A0, 0xE018);
>>> -    r8168_mac_ocp_write(tp, 0xF8A2, 0xE092);
>>> -    r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20);
>>> -    r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0);
>>> -    r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F);
>>> -    r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4);
>>> -    r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3);
>>> -    r8168_mac_ocp_write(tp, 0xF8AE, 0xF007);
>>> -    r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0);
>>> -    r8168_mac_ocp_write(tp, 0xF8B2, 0xF103);
>>> -    r8168_mac_ocp_write(tp, 0xF8B4, 0xC607);
>>> -    r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00);
>>> -    r8168_mac_ocp_write(tp, 0xF8B8, 0xC606);
>>> -    r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00);
>>> -    r8168_mac_ocp_write(tp, 0xF8BC, 0xC602);
>>> -    r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00);
>>> -    r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C);
>>> -    r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28);
>>> -    r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C);
>>> -    r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00);
>>> -    r8168_mac_ocp_write(tp, 0xF8C8, 0xC707);
>>> -    r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00);
>>> -    r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2);
>>> -    r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1);
>>> -    r8168_mac_ocp_write(tp, 0xF8D0, 0xC502);
>>> -    r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00);
>>> -    r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA);
>>> -    r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0);
>>> -    r8168_mac_ocp_write(tp, 0xF8D8, 0xC502);
>>> -    r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00);
>>> -    r8168_mac_ocp_write(tp, 0xF8DC, 0x0132);
>>> +    r8168_mac_ocp_write_seq(tp, recover_seq);
>>>         r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>>   -    r8168_mac_ocp_write(tp, 0xFC2A, 0x0743);
>>> -    r8168_mac_ocp_write(tp, 0xFC2C, 0x0801);
>>> -    r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9);
>>> -    r8168_mac_ocp_write(tp, 0xFC30, 0x02FD);
>>> -    r8168_mac_ocp_write(tp, 0xFC32, 0x0C25);
>>> -    r8168_mac_ocp_write(tp, 0xFC34, 0x00A9);
>>> -    r8168_mac_ocp_write(tp, 0xFC36, 0x012D);
>>> +    r8168_mac_ocp_write_seq(tp, final_seq);
>>> +
>>>   }
>>>     static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)

2023-10-30 13:31:18

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls



On 10/30/23 14:17, Heiner Kallweit wrote:
> On 29.10.2023 05:56, Mirsad Todorovac wrote:
>>
>>
>> On 10/28/23 21:21, Heiner Kallweit wrote:
>>> On 28.10.2023 13:05, Mirsad Goran Todorovac wrote:
>>>> On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>>> This was reported in [0] and confirmed by Realtek. Realtek provided
>>>> a sequence to fix the RX unit after PHY wakeup.
>>>>
>>>> A series of about 130 r8168_mac_ocp_write() calls is performed to
>>>> program the RTL registers for recovery.
>>>>
>>>> r8168_mac_ocp_write() expands to this code:
>>>>
>>>>          static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>>          {
>>>>                  if (rtl_ocp_reg_failure(reg))
>>>>                          return;
>>>>
>>>>                  RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>>          }
>>>>
>>>>          static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>>          {
>>>>                  unsigned long flags;
>>>>
>>>>                  raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>                  __r8168_mac_ocp_write(tp, reg, data);
>>>>                  raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>          }
>>>>
>>>> Register programming is done through RTL_W32() macro which expands into
>>>>
>>>>          #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>>
>>>> which is further (on Alpha):
>>>>
>>>>          extern inline void writel(u32 b, volatile void __iomem *addr)
>>>>          {
>>>>                  mb();
>>>>                  __raw_writel(b, addr);
>>>>          }
>>>>
>>>> or on i386/x86_64:
>>>>
>>>>      #define build_mmio_write(name, size, type, reg, barrier) \
>>>>      static inline void name(type val, volatile void __iomem *addr) \
>>>>      { asm volatile("mov" size " %0,%1": :reg (val), \
>>>>      "m" (*(volatile type __force *)addr) barrier); }
>>>>
>>>>      build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>>
>>>> This obviously involves iat least a compiler barrier.
>>>>
>>>> mb() expands into something like this i.e. on x86_64:
>>>>
>>>>          #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>>
>>>> This means a whole lot of memory bus barriers: for spin_lock_irqsave(),
>>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>>
>>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>>> a LOCK storm that will thunder all of the cores and CPUs on the same memory controller
>>>> for certain time that locked memory read-modify-write cyclo or I/O takes to finish.
>>>>
>>>> In a sequential case of RTL register programming, the writes to RTL registers
>>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>>> number of bus stalls in a multicore or multi-CPU system:
>>>>
>>>>          static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>                                                const struct recover_8411b_info *array)
>>>>          {
>>>>                  struct recover_8411b_info const *p = array;
>>>>
>>>>                  while (p->reg) {
>>>>                          if (!rtl_ocp_reg_failure(p->reg))
>>>>                                  RTL_W32(tp, OCPDR, OCPAR_FLAG | (p->reg << 15) | p->data);
>>>>                          p++;
>>>>                  }
>>>>          }
>>>>
>>>>          static void r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>                                              const struct recover_8411b_info *array)
>>>>          {
>>>>                  unsigned long flags;
>>>>
>>>>                  raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>                  __r8168_mac_ocp_write_seq(tp, array);
>>>>                  raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>          }
>>>>
>>>>          static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>>          {
>>>>
>>>>                  ...
>>>>
>>>>                  /* The following Realtek-provided magic fixes an issue with the RX unit
>>>>                   * getting confused after the PHY having been powered-down.
>>>>                   */
>>>>
>>>>                  static const struct recover_8411b_info init_zero_seq[] = {
>>>>                          { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>>>>             ...
>>>>                  };
>>>>
>>>>                  static const struct recover_8411b_info recover_seq[] = {
>>>>                          { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>>>>             ...
>>>>                  };
>>>>
>>>>                  static const struct recover_8411b_info final_seq[] = {
>>>>                          { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>>>>             ...
>>>>                  };
>>>>
>>>>                  r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>>                  mdelay(3);
>>>>                  r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>>>                  r8168_mac_ocp_write_seq(tp, recover_seq);
>>>>                  r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>>>                  r8168_mac_ocp_write_seq(tp, final_seq);
>>>>          }
>>>>
>>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>>> functions that only changed the function names and the ending of the line, so the actual
>>>> hex data is unchanged.
>>>>
>>>> Note that the original reason for the introduction of the commit fe4e8db0392a6
>>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>>
>>> I still have a problem with this statement as you're saying that the original
>>> problem still exists. I don't think that's the case.
>>
>> I will not disagree about it.
>>
>> But we have only reduced the number of spin_lock_irqsave/spin_unlock_irqrestore()
>> pairs.
>>
>> Maybe additionally, on the low level, memory barrier isn't required for each write to
>> MMIO?
>>
> One could argue whether in several places writel_relaxed() could be used.
> But it's not really worth it, because we're not in a hot path.

I see. Thank you for your evaluation.

Using writel_relaxed() sounds clever. It expands to:

#define build_mmio_write(name, size, type, reg, barrier) \
static inline void name(type val, volatile void __iomem *addr) \
{ asm volatile("mov" size " %0,%1": :reg (val), \
"m" (*(volatile type __force *)addr) barrier); }
build_mmio_write(__writel, "l", unsigned int, "r", )
#define writel_relaxed(v, a) __writel(v, a)

Here "barrier" is an empty string. Really clever. ;-)

I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
in each single one of the drivers could amount to some degrading of overall performance and
latency in a multicore system.

As I understood Mr. Jonathan Corbet on LWN, the initiative and trend is to reduce overall
kernel latency.

Thanks,
Mirsad

>> If it still uses a LOCK addl $0, m32/m64, then it still creates 130 instances of all core
>> bus locks for this NIC reset after the lost PHY? I'm just thinking, this is nothing
>> authoritative ...
>>
>>>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>>>> the programming sequence to reach RTL NIC registers.
>>>>
>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>>>
>>>> Fixes: fe4e8db0392a6 ("r8169: fix issue with confused RX unit after PHY power-down on RTL8411b")
>>>> Cc: Heiner Kallweit <[email protected]>
>>>> Cc: Marco Elver <[email protected]>
>>>> Cc: [email protected]
>>>> Cc: "David S. Miller" <[email protected]>
>>>> Cc: Eric Dumazet <[email protected]>
>>>> Cc: Jakub Kicinski <[email protected]>
>>>> Cc: Paolo Abeni <[email protected]>
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
>>>> ---
>>>> v3:
>>>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>>>
>>>>   drivers/net/ethernet/realtek/r8169_main.c | 198 ++++++++--------------
>>>>   1 file changed, 72 insertions(+), 126 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>>> index 361b90007148..3b28bec7098b 100644
>>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>>> @@ -939,6 +939,32 @@ static void r8168_mac_ocp_modify(struct rtl8169_private *tp, u32 reg, u16 mask,
>>>>       raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>   }
>>>>   +struct e_info_regmask_pair {
>>>> +    u32    reg;
>>>> +    u32    data;
>>>> +};
>>>> +
>>>> +static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>> +                      const struct e_info_regmask_pair *array, int len)
>>>> +{
>>>> +    struct e_info_regmask_pair const *p;
>>>> +
>>>> +    for (p = array; len--; p++)
>>>> +        __r8168_mac_ocp_write(tp, p->reg, p->data);
>>>> +}
>>>> +
>>>> +static void _r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>> +                     const struct e_info_regmask_pair *array, int len)
>>>> +{
>>>> +    unsigned long flags;
>>>> +
>>>> +    raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>> +    __r8168_mac_ocp_write_seq(tp, array, len);
>>>> +    raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>> +}
>>>> +
>>>> +#define r8168_mac_ocp_write_seq(tp, a) _r8168_mac_ocp_write_seq(tp, a, ARRAY_SIZE(a))
>>>> +
>>>>   /* Work around a hw issue with RTL8168g PHY, the quirk disables
>>>>    * PHY MCU interrupts before PHY power-down.
>>>>    */
>>>> @@ -3107,138 +3133,58 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>>       /* The following Realtek-provided magic fixes an issue with the RX unit
>>>>        * getting confused after the PHY having been powered-down.
>>>>        */
>>>> -    r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
>>>> -    r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
>>>> -    r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
>>>> -    r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
>>>> -    r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
>>>> -    r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
>>>> -    r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
>>>> -    r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
>>>> +
>>>> +    static const struct e_info_regmask_pair init_zero_seq[] = {
>>>> +        { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>>>> +        { 0xFC30, 0x0000 }, { 0xFC32, 0x0000 }, { 0xFC34, 0x0000 }, { 0xFC36, 0x0000 },
>>>> +    };
>>>> +
>>> Don't mix code and variable declarations. Did you run checkpatch?
>>> I think it would complain here.
>>
>> Thank you for the warning, I will fix it.
>>
>> As I said to Mr. Greg, I will do the required number of iterations to fix this issue.
>>
>> I will add checkpatch to my routine handling of my submitted patches.
>>
>> Thanks,
>> Mirsad
>>
>>>> +    static const struct e_info_regmask_pair recover_seq[] = {
>>>> +        { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>>>> +        { 0xF808, 0xE027 }, { 0xF80A, 0xE04F }, { 0xF80C, 0xE05E }, { 0xF80E, 0xE065 },
>>>> +        { 0xF810, 0xC602 }, { 0xF812, 0xBE00 }, { 0xF814, 0x0000 }, { 0xF816, 0xC502 },
>>>> +        { 0xF818, 0xBD00 }, { 0xF81A, 0x074C }, { 0xF81C, 0xC302 }, { 0xF81E, 0xBB00 },
>>>> +        { 0xF820, 0x080A }, { 0xF822, 0x6420 }, { 0xF824, 0x48C2 }, { 0xF826, 0x8C20 },
>>>> +        { 0xF828, 0xC516 }, { 0xF82A, 0x64A4 }, { 0xF82C, 0x49C0 }, { 0xF82E, 0xF009 },
>>>> +        { 0xF830, 0x74A2 }, { 0xF832, 0x8CA5 }, { 0xF834, 0x74A0 }, { 0xF836, 0xC50E },
>>>> +        { 0xF838, 0x9CA2 }, { 0xF83A, 0x1C11 }, { 0xF83C, 0x9CA0 }, { 0xF83E, 0xE006 },
>>>> +        { 0xF840, 0x74F8 }, { 0xF842, 0x48C4 }, { 0xF844, 0x8CF8 }, { 0xF846, 0xC404 },
>>>> +        { 0xF848, 0xBC00 }, { 0xF84A, 0xC403 }, { 0xF84C, 0xBC00 }, { 0xF84E, 0x0BF2 },
>>>> +        { 0xF850, 0x0C0A }, { 0xF852, 0xE434 }, { 0xF854, 0xD3C0 }, { 0xF856, 0x49D9 },
>>>> +        { 0xF858, 0xF01F }, { 0xF85A, 0xC526 }, { 0xF85C, 0x64A5 }, { 0xF85E, 0x1400 },
>>>> +        { 0xF860, 0xF007 }, { 0xF862, 0x0C01 }, { 0xF864, 0x8CA5 }, { 0xF866, 0x1C15 },
>>>> +        { 0xF868, 0xC51B }, { 0xF86A, 0x9CA0 }, { 0xF86C, 0xE013 }, { 0xF86E, 0xC519 },
>>>> +        { 0xF870, 0x74A0 }, { 0xF872, 0x48C4 }, { 0xF874, 0x8CA0 }, { 0xF876, 0xC516 },
>>>> +        { 0xF878, 0x74A4 }, { 0xF87A, 0x48C8 }, { 0xF87C, 0x48CA }, { 0xF87E, 0x9CA4 },
>>>> +        { 0xF880, 0xC512 }, { 0xF882, 0x1B00 }, { 0xF884, 0x9BA0 }, { 0xF886, 0x1B1C },
>>>> +        { 0xF888, 0x483F }, { 0xF88A, 0x9BA2 }, { 0xF88C, 0x1B04 }, { 0xF88E, 0xC508 },
>>>> +        { 0xF890, 0x9BA0 }, { 0xF892, 0xC505 }, { 0xF894, 0xBD00 }, { 0xF896, 0xC502 },
>>>> +        { 0xF898, 0xBD00 }, { 0xF89A, 0x0300 }, { 0xF89C, 0x051E }, { 0xF89E, 0xE434 },
>>>> +        { 0xF8A0, 0xE018 }, { 0xF8A2, 0xE092 }, { 0xF8A4, 0xDE20 }, { 0xF8A6, 0xD3C0 },
>>>> +        { 0xF8A8, 0xC50F }, { 0xF8AA, 0x76A4 }, { 0xF8AC, 0x49E3 }, { 0xF8AE, 0xF007 },
>>>> +        { 0xF8B0, 0x49C0 }, { 0xF8B2, 0xF103 }, { 0xF8B4, 0xC607 }, { 0xF8B6, 0xBE00 },
>>>> +        { 0xF8B8, 0xC606 }, { 0xF8BA, 0xBE00 }, { 0xF8BC, 0xC602 }, { 0xF8BE, 0xBE00 },
>>>> +        { 0xF8C0, 0x0C4C }, { 0xF8C2, 0x0C28 }, { 0xF8C4, 0x0C2C }, { 0xF8C6, 0xDC00 },
>>>> +        { 0xF8C8, 0xC707 }, { 0xF8CA, 0x1D00 }, { 0xF8CC, 0x8DE2 }, { 0xF8CE, 0x48C1 },
>>>> +        { 0xF8D0, 0xC502 }, { 0xF8D2, 0xBD00 }, { 0xF8D4, 0x00AA }, { 0xF8D6, 0xE0C0 },
>>>> +        { 0xF8D8, 0xC502 }, { 0xF8DA, 0xBD00 }, { 0xF8DC, 0x0132 },
>>>> +    };
>>>> +
>>>> +    static const struct e_info_regmask_pair final_seq[] = {
>>>> +        { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>>>> +        { 0xFC32, 0x0C25 }, { 0xFC34, 0x00A9 }, { 0xFC36, 0x012D },
>>>> +    };
>>>> +
>>>> +    r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>>       mdelay(3);
>>>>       r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>>>   -    r8168_mac_ocp_write(tp, 0xF800, 0xE008);
>>>> -    r8168_mac_ocp_write(tp, 0xF802, 0xE00A);
>>>> -    r8168_mac_ocp_write(tp, 0xF804, 0xE00C);
>>>> -    r8168_mac_ocp_write(tp, 0xF806, 0xE00E);
>>>> -    r8168_mac_ocp_write(tp, 0xF808, 0xE027);
>>>> -    r8168_mac_ocp_write(tp, 0xF80A, 0xE04F);
>>>> -    r8168_mac_ocp_write(tp, 0xF80C, 0xE05E);
>>>> -    r8168_mac_ocp_write(tp, 0xF80E, 0xE065);
>>>> -    r8168_mac_ocp_write(tp, 0xF810, 0xC602);
>>>> -    r8168_mac_ocp_write(tp, 0xF812, 0xBE00);
>>>> -    r8168_mac_ocp_write(tp, 0xF814, 0x0000);
>>>> -    r8168_mac_ocp_write(tp, 0xF816, 0xC502);
>>>> -    r8168_mac_ocp_write(tp, 0xF818, 0xBD00);
>>>> -    r8168_mac_ocp_write(tp, 0xF81A, 0x074C);
>>>> -    r8168_mac_ocp_write(tp, 0xF81C, 0xC302);
>>>> -    r8168_mac_ocp_write(tp, 0xF81E, 0xBB00);
>>>> -    r8168_mac_ocp_write(tp, 0xF820, 0x080A);
>>>> -    r8168_mac_ocp_write(tp, 0xF822, 0x6420);
>>>> -    r8168_mac_ocp_write(tp, 0xF824, 0x48C2);
>>>> -    r8168_mac_ocp_write(tp, 0xF826, 0x8C20);
>>>> -    r8168_mac_ocp_write(tp, 0xF828, 0xC516);
>>>> -    r8168_mac_ocp_write(tp, 0xF82A, 0x64A4);
>>>> -    r8168_mac_ocp_write(tp, 0xF82C, 0x49C0);
>>>> -    r8168_mac_ocp_write(tp, 0xF82E, 0xF009);
>>>> -    r8168_mac_ocp_write(tp, 0xF830, 0x74A2);
>>>> -    r8168_mac_ocp_write(tp, 0xF832, 0x8CA5);
>>>> -    r8168_mac_ocp_write(tp, 0xF834, 0x74A0);
>>>> -    r8168_mac_ocp_write(tp, 0xF836, 0xC50E);
>>>> -    r8168_mac_ocp_write(tp, 0xF838, 0x9CA2);
>>>> -    r8168_mac_ocp_write(tp, 0xF83A, 0x1C11);
>>>> -    r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0);
>>>> -    r8168_mac_ocp_write(tp, 0xF83E, 0xE006);
>>>> -    r8168_mac_ocp_write(tp, 0xF840, 0x74F8);
>>>> -    r8168_mac_ocp_write(tp, 0xF842, 0x48C4);
>>>> -    r8168_mac_ocp_write(tp, 0xF844, 0x8CF8);
>>>> -    r8168_mac_ocp_write(tp, 0xF846, 0xC404);
>>>> -    r8168_mac_ocp_write(tp, 0xF848, 0xBC00);
>>>> -    r8168_mac_ocp_write(tp, 0xF84A, 0xC403);
>>>> -    r8168_mac_ocp_write(tp, 0xF84C, 0xBC00);
>>>> -    r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2);
>>>> -    r8168_mac_ocp_write(tp, 0xF850, 0x0C0A);
>>>> -    r8168_mac_ocp_write(tp, 0xF852, 0xE434);
>>>> -    r8168_mac_ocp_write(tp, 0xF854, 0xD3C0);
>>>> -    r8168_mac_ocp_write(tp, 0xF856, 0x49D9);
>>>> -    r8168_mac_ocp_write(tp, 0xF858, 0xF01F);
>>>> -    r8168_mac_ocp_write(tp, 0xF85A, 0xC526);
>>>> -    r8168_mac_ocp_write(tp, 0xF85C, 0x64A5);
>>>> -    r8168_mac_ocp_write(tp, 0xF85E, 0x1400);
>>>> -    r8168_mac_ocp_write(tp, 0xF860, 0xF007);
>>>> -    r8168_mac_ocp_write(tp, 0xF862, 0x0C01);
>>>> -    r8168_mac_ocp_write(tp, 0xF864, 0x8CA5);
>>>> -    r8168_mac_ocp_write(tp, 0xF866, 0x1C15);
>>>> -    r8168_mac_ocp_write(tp, 0xF868, 0xC51B);
>>>> -    r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0);
>>>> -    r8168_mac_ocp_write(tp, 0xF86C, 0xE013);
>>>> -    r8168_mac_ocp_write(tp, 0xF86E, 0xC519);
>>>> -    r8168_mac_ocp_write(tp, 0xF870, 0x74A0);
>>>> -    r8168_mac_ocp_write(tp, 0xF872, 0x48C4);
>>>> -    r8168_mac_ocp_write(tp, 0xF874, 0x8CA0);
>>>> -    r8168_mac_ocp_write(tp, 0xF876, 0xC516);
>>>> -    r8168_mac_ocp_write(tp, 0xF878, 0x74A4);
>>>> -    r8168_mac_ocp_write(tp, 0xF87A, 0x48C8);
>>>> -    r8168_mac_ocp_write(tp, 0xF87C, 0x48CA);
>>>> -    r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4);
>>>> -    r8168_mac_ocp_write(tp, 0xF880, 0xC512);
>>>> -    r8168_mac_ocp_write(tp, 0xF882, 0x1B00);
>>>> -    r8168_mac_ocp_write(tp, 0xF884, 0x9BA0);
>>>> -    r8168_mac_ocp_write(tp, 0xF886, 0x1B1C);
>>>> -    r8168_mac_ocp_write(tp, 0xF888, 0x483F);
>>>> -    r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2);
>>>> -    r8168_mac_ocp_write(tp, 0xF88C, 0x1B04);
>>>> -    r8168_mac_ocp_write(tp, 0xF88E, 0xC508);
>>>> -    r8168_mac_ocp_write(tp, 0xF890, 0x9BA0);
>>>> -    r8168_mac_ocp_write(tp, 0xF892, 0xC505);
>>>> -    r8168_mac_ocp_write(tp, 0xF894, 0xBD00);
>>>> -    r8168_mac_ocp_write(tp, 0xF896, 0xC502);
>>>> -    r8168_mac_ocp_write(tp, 0xF898, 0xBD00);
>>>> -    r8168_mac_ocp_write(tp, 0xF89A, 0x0300);
>>>> -    r8168_mac_ocp_write(tp, 0xF89C, 0x051E);
>>>> -    r8168_mac_ocp_write(tp, 0xF89E, 0xE434);
>>>> -    r8168_mac_ocp_write(tp, 0xF8A0, 0xE018);
>>>> -    r8168_mac_ocp_write(tp, 0xF8A2, 0xE092);
>>>> -    r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20);
>>>> -    r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0);
>>>> -    r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F);
>>>> -    r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4);
>>>> -    r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3);
>>>> -    r8168_mac_ocp_write(tp, 0xF8AE, 0xF007);
>>>> -    r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0);
>>>> -    r8168_mac_ocp_write(tp, 0xF8B2, 0xF103);
>>>> -    r8168_mac_ocp_write(tp, 0xF8B4, 0xC607);
>>>> -    r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00);
>>>> -    r8168_mac_ocp_write(tp, 0xF8B8, 0xC606);
>>>> -    r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00);
>>>> -    r8168_mac_ocp_write(tp, 0xF8BC, 0xC602);
>>>> -    r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00);
>>>> -    r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C);
>>>> -    r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28);
>>>> -    r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C);
>>>> -    r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00);
>>>> -    r8168_mac_ocp_write(tp, 0xF8C8, 0xC707);
>>>> -    r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00);
>>>> -    r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2);
>>>> -    r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1);
>>>> -    r8168_mac_ocp_write(tp, 0xF8D0, 0xC502);
>>>> -    r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00);
>>>> -    r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA);
>>>> -    r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0);
>>>> -    r8168_mac_ocp_write(tp, 0xF8D8, 0xC502);
>>>> -    r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00);
>>>> -    r8168_mac_ocp_write(tp, 0xF8DC, 0x0132);
>>>> +    r8168_mac_ocp_write_seq(tp, recover_seq);
>>>>         r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>>>   -    r8168_mac_ocp_write(tp, 0xFC2A, 0x0743);
>>>> -    r8168_mac_ocp_write(tp, 0xFC2C, 0x0801);
>>>> -    r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9);
>>>> -    r8168_mac_ocp_write(tp, 0xFC30, 0x02FD);
>>>> -    r8168_mac_ocp_write(tp, 0xFC32, 0x0C25);
>>>> -    r8168_mac_ocp_write(tp, 0xFC34, 0x00A9);
>>>> -    r8168_mac_ocp_write(tp, 0xFC36, 0x012D);
>>>> +    r8168_mac_ocp_write_seq(tp, final_seq);
>>>> +
>>>>   }
>>>>     static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)

2023-10-30 13:53:37

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

On 10/30/23 14:30, Mirsad Todorovac wrote:
>
>
> On 10/30/23 14:17, Heiner Kallweit wrote:
>> On 29.10.2023 05:56, Mirsad Todorovac wrote:
>>>
>>>
>>> On 10/28/23 21:21, Heiner Kallweit wrote:
>>>> On 28.10.2023 13:05, Mirsad Goran Todorovac wrote:
>>>>> On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>>>> This was reported in [0] and confirmed by Realtek. Realtek provided
>>>>> a sequence to fix the RX unit after PHY wakeup.
>>>>>
>>>>> A series of about 130 r8168_mac_ocp_write() calls is performed to
>>>>> program the RTL registers for recovery.
>>>>>
>>>>> r8168_mac_ocp_write() expands to this code:
>>>>>
>>>>>           static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>>>           {
>>>>>                   if (rtl_ocp_reg_failure(reg))
>>>>>                           return;
>>>>>
>>>>>                   RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>>>           }
>>>>>
>>>>>           static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>>>           {
>>>>>                   unsigned long flags;
>>>>>
>>>>>                   raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>>                   __r8168_mac_ocp_write(tp, reg, data);
>>>>>                   raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>>           }
>>>>>
>>>>> Register programming is done through RTL_W32() macro which expands into
>>>>>
>>>>>           #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>>>
>>>>> which is further (on Alpha):
>>>>>
>>>>>           extern inline void writel(u32 b, volatile void __iomem *addr)
>>>>>           {
>>>>>                   mb();
>>>>>                   __raw_writel(b, addr);
>>>>>           }
>>>>>
>>>>> or on i386/x86_64:
>>>>>
>>>>>       #define build_mmio_write(name, size, type, reg, barrier) \
>>>>>       static inline void name(type val, volatile void __iomem *addr) \
>>>>>       { asm volatile("mov" size " %0,%1": :reg (val), \
>>>>>       "m" (*(volatile type __force *)addr) barrier); }
>>>>>
>>>>>       build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>>>
>>>>> This obviously involves iat least a compiler barrier.
>>>>>
>>>>> mb() expands into something like this i.e. on x86_64:
>>>>>
>>>>>           #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>>>
>>>>> This means a whole lot of memory bus barriers: for spin_lock_irqsave(),
>>>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>>>
>>>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>>>> a LOCK storm that will thunder all of the cores and CPUs on the same memory controller
>>>>> for certain time that locked memory read-modify-write cyclo or I/O takes to finish.
>>>>>
>>>>> In a sequential case of RTL register programming, the writes to RTL registers
>>>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>>>> number of bus stalls in a multicore or multi-CPU system:
>>>>>
>>>>>           static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>>                                                 const struct recover_8411b_info *array)
>>>>>           {
>>>>>                   struct recover_8411b_info const *p = array;
>>>>>
>>>>>                   while (p->reg) {
>>>>>                           if (!rtl_ocp_reg_failure(p->reg))
>>>>>                                   RTL_W32(tp, OCPDR, OCPAR_FLAG | (p->reg << 15) | p->data);
>>>>>                           p++;
>>>>>                   }
>>>>>           }
>>>>>
>>>>>           static void r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>>                                               const struct recover_8411b_info *array)
>>>>>           {
>>>>>                   unsigned long flags;
>>>>>
>>>>>                   raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>>                   __r8168_mac_ocp_write_seq(tp, array);
>>>>>                   raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>>           }
>>>>>
>>>>>           static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>>>           {
>>>>>
>>>>>                   ...
>>>>>
>>>>>                   /* The following Realtek-provided magic fixes an issue with the RX unit
>>>>>                    * getting confused after the PHY having been powered-down.
>>>>>                    */
>>>>>
>>>>>                   static const struct recover_8411b_info init_zero_seq[] = {
>>>>>                           { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>>>>>              ...
>>>>>                   };
>>>>>
>>>>>                   static const struct recover_8411b_info recover_seq[] = {
>>>>>                           { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>>>>>              ...
>>>>>                   };
>>>>>
>>>>>                   static const struct recover_8411b_info final_seq[] = {
>>>>>                           { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>>>>>              ...
>>>>>                   };
>>>>>
>>>>>                   r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>>>                   mdelay(3);
>>>>>                   r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>>>>                   r8168_mac_ocp_write_seq(tp, recover_seq);
>>>>>                   r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>>>>                   r8168_mac_ocp_write_seq(tp, final_seq);
>>>>>           }
>>>>>
>>>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>>>> functions that only changed the function names and the ending of the line, so the actual
>>>>> hex data is unchanged.
>>>>>
>>>>> Note that the original reason for the introduction of the commit fe4e8db0392a6
>>>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>>>
>>>> I still have a problem with this statement as you're saying that the original
>>>> problem still exists. I don't think that's the case.
>>>
>>> I will not disagree about it.
>>>
>>> But we have only reduced the number of spin_lock_irqsave/spin_unlock_irqrestore()
>>> pairs.
>>>
>>> Maybe additionally, on the low level, memory barrier isn't required for each write to
>>> MMIO?
>>>
>> One could argue whether in several places writel_relaxed() could be used.
>> But it's not really worth it, because we're not in a hot path.
>
> I see. Thank you for your evaluation.
>
> Using writel_relaxed() sounds clever. It expands to:
>
>     #define build_mmio_write(name, size, type, reg, barrier) \
>         static inline void name(type val, volatile void __iomem *addr) \
>          { asm volatile("mov" size " %0,%1": :reg (val), \
>                         "m" (*(volatile type __force *)addr) barrier); }
>     build_mmio_write(__writel, "l", unsigned int, "r", )
>     #define writel_relaxed(v, a) __writel(v, a)
>
> Here "barrier" is an empty string. Really clever. ;-)
>
> I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
> in each single one of the drivers could amount to some degrading of overall performance and
> latency in a multicore system.
>
> As I understood Mr. Jonathan Corbet on LWN, the initiative and trend is to reduce overall
> kernel latency.

P.S.

On the second thought, if barrier() is only the compiler optimisation barrier from memory
reordering, then we do not gain much disablin git as it doesn't affect the other cores, and
reordering MMIO writes can really confuse some NIC hardware.

/* Optimization barrier */
#ifndef barrier
/* The "volatile" is due to gcc bugs */
# define barrier() __asm__ __volatile__("": : :"memory")
#endif

>>> If it still uses a LOCK addl $0, m32/m64, then it still creates 130 instances of all core
>>> bus locks for this NIC reset after the lost PHY? I'm just thinking, this is nothing
>>> authoritative ...
>>>
>>>>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>>>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>>>>> the programming sequence to reach RTL NIC registers.
>>>>>
>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>>>>
>>>>> Fixes: fe4e8db0392a6 ("r8169: fix issue with confused RX unit after PHY power-down on RTL8411b")
>>>>> Cc: Heiner Kallweit <[email protected]>
>>>>> Cc: Marco Elver <[email protected]>
>>>>> Cc: [email protected]
>>>>> Cc: "David S. Miller" <[email protected]>
>>>>> Cc: Eric Dumazet <[email protected]>
>>>>> Cc: Jakub Kicinski <[email protected]>
>>>>> Cc: Paolo Abeni <[email protected]>
>>>>> Cc: [email protected]
>>>>> Cc: [email protected]
>>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>>> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
>>>>> ---
>>>>> v3:
>>>>>    removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>>>    avoided duplication of RTL_W32() call code as advised by Heiner.
>>>>>
>>>>>    drivers/net/ethernet/realtek/r8169_main.c | 198 ++++++++--------------
>>>>>    1 file changed, 72 insertions(+), 126 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>>>> index 361b90007148..3b28bec7098b 100644
>>>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>>>> @@ -939,6 +939,32 @@ static void r8168_mac_ocp_modify(struct rtl8169_private *tp, u32 reg, u16 mask,
>>>>>        raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>>    }
>>>>>    +struct e_info_regmask_pair {
>>>>> +    u32    reg;
>>>>> +    u32    data;
>>>>> +};
>>>>> +
>>>>> +static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>> +                      const struct e_info_regmask_pair *array, int len)
>>>>> +{
>>>>> +    struct e_info_regmask_pair const *p;
>>>>> +
>>>>> +    for (p = array; len--; p++)
>>>>> +        __r8168_mac_ocp_write(tp, p->reg, p->data);
>>>>> +}
>>>>> +
>>>>> +static void _r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>> +                     const struct e_info_regmask_pair *array, int len)
>>>>> +{
>>>>> +    unsigned long flags;
>>>>> +
>>>>> +    raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>> +    __r8168_mac_ocp_write_seq(tp, array, len);
>>>>> +    raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>> +}
>>>>> +
>>>>> +#define r8168_mac_ocp_write_seq(tp, a) _r8168_mac_ocp_write_seq(tp, a, ARRAY_SIZE(a))
>>>>> +
>>>>>    /* Work around a hw issue with RTL8168g PHY, the quirk disables
>>>>>     * PHY MCU interrupts before PHY power-down.
>>>>>     */
>>>>> @@ -3107,138 +3133,58 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>>>        /* The following Realtek-provided magic fixes an issue with the RX unit
>>>>>         * getting confused after the PHY having been powered-down.
>>>>>         */
>>>>> -    r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
>>>>> +
>>>>> +    static const struct e_info_regmask_pair init_zero_seq[] = {
>>>>> +        { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>>>>> +        { 0xFC30, 0x0000 }, { 0xFC32, 0x0000 }, { 0xFC34, 0x0000 }, { 0xFC36, 0x0000 },
>>>>> +    };
>>>>> +
>>>> Don't mix code and variable declarations. Did you run checkpatch?
>>>> I think it would complain here.
>>>
>>> Thank you for the warning, I will fix it.
>>>
>>> As I said to Mr. Greg, I will do the required number of iterations to fix this issue.
>>>
>>> I will add checkpatch to my routine handling of my submitted patches.
>>>
>>> Thanks,
>>> Mirsad
>>>
>>>>> +    static const struct e_info_regmask_pair recover_seq[] = {
>>>>> +        { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>>>>> +        { 0xF808, 0xE027 }, { 0xF80A, 0xE04F }, { 0xF80C, 0xE05E }, { 0xF80E, 0xE065 },
>>>>> +        { 0xF810, 0xC602 }, { 0xF812, 0xBE00 }, { 0xF814, 0x0000 }, { 0xF816, 0xC502 },
>>>>> +        { 0xF818, 0xBD00 }, { 0xF81A, 0x074C }, { 0xF81C, 0xC302 }, { 0xF81E, 0xBB00 },
>>>>> +        { 0xF820, 0x080A }, { 0xF822, 0x6420 }, { 0xF824, 0x48C2 }, { 0xF826, 0x8C20 },
>>>>> +        { 0xF828, 0xC516 }, { 0xF82A, 0x64A4 }, { 0xF82C, 0x49C0 }, { 0xF82E, 0xF009 },
>>>>> +        { 0xF830, 0x74A2 }, { 0xF832, 0x8CA5 }, { 0xF834, 0x74A0 }, { 0xF836, 0xC50E },
>>>>> +        { 0xF838, 0x9CA2 }, { 0xF83A, 0x1C11 }, { 0xF83C, 0x9CA0 }, { 0xF83E, 0xE006 },
>>>>> +        { 0xF840, 0x74F8 }, { 0xF842, 0x48C4 }, { 0xF844, 0x8CF8 }, { 0xF846, 0xC404 },
>>>>> +        { 0xF848, 0xBC00 }, { 0xF84A, 0xC403 }, { 0xF84C, 0xBC00 }, { 0xF84E, 0x0BF2 },
>>>>> +        { 0xF850, 0x0C0A }, { 0xF852, 0xE434 }, { 0xF854, 0xD3C0 }, { 0xF856, 0x49D9 },
>>>>> +        { 0xF858, 0xF01F }, { 0xF85A, 0xC526 }, { 0xF85C, 0x64A5 }, { 0xF85E, 0x1400 },
>>>>> +        { 0xF860, 0xF007 }, { 0xF862, 0x0C01 }, { 0xF864, 0x8CA5 }, { 0xF866, 0x1C15 },
>>>>> +        { 0xF868, 0xC51B }, { 0xF86A, 0x9CA0 }, { 0xF86C, 0xE013 }, { 0xF86E, 0xC519 },
>>>>> +        { 0xF870, 0x74A0 }, { 0xF872, 0x48C4 }, { 0xF874, 0x8CA0 }, { 0xF876, 0xC516 },
>>>>> +        { 0xF878, 0x74A4 }, { 0xF87A, 0x48C8 }, { 0xF87C, 0x48CA }, { 0xF87E, 0x9CA4 },
>>>>> +        { 0xF880, 0xC512 }, { 0xF882, 0x1B00 }, { 0xF884, 0x9BA0 }, { 0xF886, 0x1B1C },
>>>>> +        { 0xF888, 0x483F }, { 0xF88A, 0x9BA2 }, { 0xF88C, 0x1B04 }, { 0xF88E, 0xC508 },
>>>>> +        { 0xF890, 0x9BA0 }, { 0xF892, 0xC505 }, { 0xF894, 0xBD00 }, { 0xF896, 0xC502 },
>>>>> +        { 0xF898, 0xBD00 }, { 0xF89A, 0x0300 }, { 0xF89C, 0x051E }, { 0xF89E, 0xE434 },
>>>>> +        { 0xF8A0, 0xE018 }, { 0xF8A2, 0xE092 }, { 0xF8A4, 0xDE20 }, { 0xF8A6, 0xD3C0 },
>>>>> +        { 0xF8A8, 0xC50F }, { 0xF8AA, 0x76A4 }, { 0xF8AC, 0x49E3 }, { 0xF8AE, 0xF007 },
>>>>> +        { 0xF8B0, 0x49C0 }, { 0xF8B2, 0xF103 }, { 0xF8B4, 0xC607 }, { 0xF8B6, 0xBE00 },
>>>>> +        { 0xF8B8, 0xC606 }, { 0xF8BA, 0xBE00 }, { 0xF8BC, 0xC602 }, { 0xF8BE, 0xBE00 },
>>>>> +        { 0xF8C0, 0x0C4C }, { 0xF8C2, 0x0C28 }, { 0xF8C4, 0x0C2C }, { 0xF8C6, 0xDC00 },
>>>>> +        { 0xF8C8, 0xC707 }, { 0xF8CA, 0x1D00 }, { 0xF8CC, 0x8DE2 }, { 0xF8CE, 0x48C1 },
>>>>> +        { 0xF8D0, 0xC502 }, { 0xF8D2, 0xBD00 }, { 0xF8D4, 0x00AA }, { 0xF8D6, 0xE0C0 },
>>>>> +        { 0xF8D8, 0xC502 }, { 0xF8DA, 0xBD00 }, { 0xF8DC, 0x0132 },
>>>>> +    };
>>>>> +
>>>>> +    static const struct e_info_regmask_pair final_seq[] = {
>>>>> +        { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>>>>> +        { 0xFC32, 0x0C25 }, { 0xFC34, 0x00A9 }, { 0xFC36, 0x012D },
>>>>> +    };
>>>>> +
>>>>> +    r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>>>        mdelay(3);
>>>>>        r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>>>>    -    r8168_mac_ocp_write(tp, 0xF800, 0xE008);
>>>>> -    r8168_mac_ocp_write(tp, 0xF802, 0xE00A);
>>>>> -    r8168_mac_ocp_write(tp, 0xF804, 0xE00C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF806, 0xE00E);
>>>>> -    r8168_mac_ocp_write(tp, 0xF808, 0xE027);
>>>>> -    r8168_mac_ocp_write(tp, 0xF80A, 0xE04F);
>>>>> -    r8168_mac_ocp_write(tp, 0xF80C, 0xE05E);
>>>>> -    r8168_mac_ocp_write(tp, 0xF80E, 0xE065);
>>>>> -    r8168_mac_ocp_write(tp, 0xF810, 0xC602);
>>>>> -    r8168_mac_ocp_write(tp, 0xF812, 0xBE00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF814, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xF816, 0xC502);
>>>>> -    r8168_mac_ocp_write(tp, 0xF818, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF81A, 0x074C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF81C, 0xC302);
>>>>> -    r8168_mac_ocp_write(tp, 0xF81E, 0xBB00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF820, 0x080A);
>>>>> -    r8168_mac_ocp_write(tp, 0xF822, 0x6420);
>>>>> -    r8168_mac_ocp_write(tp, 0xF824, 0x48C2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF826, 0x8C20);
>>>>> -    r8168_mac_ocp_write(tp, 0xF828, 0xC516);
>>>>> -    r8168_mac_ocp_write(tp, 0xF82A, 0x64A4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF82C, 0x49C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF82E, 0xF009);
>>>>> -    r8168_mac_ocp_write(tp, 0xF830, 0x74A2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF832, 0x8CA5);
>>>>> -    r8168_mac_ocp_write(tp, 0xF834, 0x74A0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF836, 0xC50E);
>>>>> -    r8168_mac_ocp_write(tp, 0xF838, 0x9CA2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF83A, 0x1C11);
>>>>> -    r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF83E, 0xE006);
>>>>> -    r8168_mac_ocp_write(tp, 0xF840, 0x74F8);
>>>>> -    r8168_mac_ocp_write(tp, 0xF842, 0x48C4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF844, 0x8CF8);
>>>>> -    r8168_mac_ocp_write(tp, 0xF846, 0xC404);
>>>>> -    r8168_mac_ocp_write(tp, 0xF848, 0xBC00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF84A, 0xC403);
>>>>> -    r8168_mac_ocp_write(tp, 0xF84C, 0xBC00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF850, 0x0C0A);
>>>>> -    r8168_mac_ocp_write(tp, 0xF852, 0xE434);
>>>>> -    r8168_mac_ocp_write(tp, 0xF854, 0xD3C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF856, 0x49D9);
>>>>> -    r8168_mac_ocp_write(tp, 0xF858, 0xF01F);
>>>>> -    r8168_mac_ocp_write(tp, 0xF85A, 0xC526);
>>>>> -    r8168_mac_ocp_write(tp, 0xF85C, 0x64A5);
>>>>> -    r8168_mac_ocp_write(tp, 0xF85E, 0x1400);
>>>>> -    r8168_mac_ocp_write(tp, 0xF860, 0xF007);
>>>>> -    r8168_mac_ocp_write(tp, 0xF862, 0x0C01);
>>>>> -    r8168_mac_ocp_write(tp, 0xF864, 0x8CA5);
>>>>> -    r8168_mac_ocp_write(tp, 0xF866, 0x1C15);
>>>>> -    r8168_mac_ocp_write(tp, 0xF868, 0xC51B);
>>>>> -    r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF86C, 0xE013);
>>>>> -    r8168_mac_ocp_write(tp, 0xF86E, 0xC519);
>>>>> -    r8168_mac_ocp_write(tp, 0xF870, 0x74A0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF872, 0x48C4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF874, 0x8CA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF876, 0xC516);
>>>>> -    r8168_mac_ocp_write(tp, 0xF878, 0x74A4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF87A, 0x48C8);
>>>>> -    r8168_mac_ocp_write(tp, 0xF87C, 0x48CA);
>>>>> -    r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF880, 0xC512);
>>>>> -    r8168_mac_ocp_write(tp, 0xF882, 0x1B00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF884, 0x9BA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF886, 0x1B1C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF888, 0x483F);
>>>>> -    r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF88C, 0x1B04);
>>>>> -    r8168_mac_ocp_write(tp, 0xF88E, 0xC508);
>>>>> -    r8168_mac_ocp_write(tp, 0xF890, 0x9BA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF892, 0xC505);
>>>>> -    r8168_mac_ocp_write(tp, 0xF894, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF896, 0xC502);
>>>>> -    r8168_mac_ocp_write(tp, 0xF898, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF89A, 0x0300);
>>>>> -    r8168_mac_ocp_write(tp, 0xF89C, 0x051E);
>>>>> -    r8168_mac_ocp_write(tp, 0xF89E, 0xE434);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A0, 0xE018);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A2, 0xE092);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8AE, 0xF007);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B2, 0xF103);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B4, 0xC607);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B8, 0xC606);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8BC, 0xC602);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C8, 0xC707);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D0, 0xC502);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D8, 0xC502);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8DC, 0x0132);
>>>>> +    r8168_mac_ocp_write_seq(tp, recover_seq);
>>>>>          r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>>>>    -    r8168_mac_ocp_write(tp, 0xFC2A, 0x0743);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2C, 0x0801);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC30, 0x02FD);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC32, 0x0C25);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC34, 0x00A9);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC36, 0x012D);
>>>>> +    r8168_mac_ocp_write_seq(tp, final_seq);
>>>>> +
>>>>>    }
>>>>>      static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)

2023-10-30 14:04:29

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

On 30.10.2023 14:30, Mirsad Todorovac wrote:
>
>
> On 10/30/23 14:17, Heiner Kallweit wrote:
>> On 29.10.2023 05:56, Mirsad Todorovac wrote:
>>>
>>>
>>> On 10/28/23 21:21, Heiner Kallweit wrote:
>>>> On 28.10.2023 13:05, Mirsad Goran Todorovac wrote:
>>>>> On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>>>> This was reported in [0] and confirmed by Realtek. Realtek provided
>>>>> a sequence to fix the RX unit after PHY wakeup.
>>>>>
>>>>> A series of about 130 r8168_mac_ocp_write() calls is performed to
>>>>> program the RTL registers for recovery.
>>>>>
>>>>> r8168_mac_ocp_write() expands to this code:
>>>>>
>>>>>           static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>>>           {
>>>>>                   if (rtl_ocp_reg_failure(reg))
>>>>>                           return;
>>>>>
>>>>>                   RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>>>           }
>>>>>
>>>>>           static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>>>           {
>>>>>                   unsigned long flags;
>>>>>
>>>>>                   raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>>                   __r8168_mac_ocp_write(tp, reg, data);
>>>>>                   raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>>           }
>>>>>
>>>>> Register programming is done through RTL_W32() macro which expands into
>>>>>
>>>>>           #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>>>
>>>>> which is further (on Alpha):
>>>>>
>>>>>           extern inline void writel(u32 b, volatile void __iomem *addr)
>>>>>           {
>>>>>                   mb();
>>>>>                   __raw_writel(b, addr);
>>>>>           }
>>>>>
>>>>> or on i386/x86_64:
>>>>>
>>>>>       #define build_mmio_write(name, size, type, reg, barrier) \
>>>>>       static inline void name(type val, volatile void __iomem *addr) \
>>>>>       { asm volatile("mov" size " %0,%1": :reg (val), \
>>>>>       "m" (*(volatile type __force *)addr) barrier); }
>>>>>
>>>>>       build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>>>
>>>>> This obviously involves iat least a compiler barrier.
>>>>>
>>>>> mb() expands into something like this i.e. on x86_64:
>>>>>
>>>>>           #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>>>
>>>>> This means a whole lot of memory bus barriers: for spin_lock_irqsave(),
>>>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>>>
>>>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>>>> a LOCK storm that will thunder all of the cores and CPUs on the same memory controller
>>>>> for certain time that locked memory read-modify-write cyclo or I/O takes to finish.
>>>>>
>>>>> In a sequential case of RTL register programming, the writes to RTL registers
>>>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>>>> number of bus stalls in a multicore or multi-CPU system:
>>>>>
>>>>>           static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>>                                                 const struct recover_8411b_info *array)
>>>>>           {
>>>>>                   struct recover_8411b_info const *p = array;
>>>>>
>>>>>                   while (p->reg) {
>>>>>                           if (!rtl_ocp_reg_failure(p->reg))
>>>>>                                   RTL_W32(tp, OCPDR, OCPAR_FLAG | (p->reg << 15) | p->data);
>>>>>                           p++;
>>>>>                   }
>>>>>           }
>>>>>
>>>>>           static void r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>>                                               const struct recover_8411b_info *array)
>>>>>           {
>>>>>                   unsigned long flags;
>>>>>
>>>>>                   raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>>                   __r8168_mac_ocp_write_seq(tp, array);
>>>>>                   raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>>           }
>>>>>
>>>>>           static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>>>           {
>>>>>
>>>>>                   ...
>>>>>
>>>>>                   /* The following Realtek-provided magic fixes an issue with the RX unit
>>>>>                    * getting confused after the PHY having been powered-down.
>>>>>                    */
>>>>>
>>>>>                   static const struct recover_8411b_info init_zero_seq[] = {
>>>>>                           { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>>>>>              ...
>>>>>                   };
>>>>>
>>>>>                   static const struct recover_8411b_info recover_seq[] = {
>>>>>                           { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>>>>>              ...
>>>>>                   };
>>>>>
>>>>>                   static const struct recover_8411b_info final_seq[] = {
>>>>>                           { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>>>>>              ...
>>>>>                   };
>>>>>
>>>>>                   r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>>>                   mdelay(3);
>>>>>                   r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>>>>                   r8168_mac_ocp_write_seq(tp, recover_seq);
>>>>>                   r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>>>>                   r8168_mac_ocp_write_seq(tp, final_seq);
>>>>>           }
>>>>>
>>>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>>>> functions that only changed the function names and the ending of the line, so the actual
>>>>> hex data is unchanged.
>>>>>
>>>>> Note that the original reason for the introduction of the commit fe4e8db0392a6
>>>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>>>
>>>> I still have a problem with this statement as you're saying that the original
>>>> problem still exists. I don't think that's the case.
>>>
>>> I will not disagree about it.
>>>
>>> But we have only reduced the number of spin_lock_irqsave/spin_unlock_irqrestore()
>>> pairs.
>>>
>>> Maybe additionally, on the low level, memory barrier isn't required for each write to
>>> MMIO?
>>>
>> One could argue whether in several places writel_relaxed() could be used.
>> But it's not really worth it, because we're not in a hot path.
>
> I see. Thank you for your evaluation.
>
> Using writel_relaxed() sounds clever. It expands to:
>
>     #define build_mmio_write(name, size, type, reg, barrier) \
>         static inline void name(type val, volatile void __iomem *addr) \
>          { asm volatile("mov" size " %0,%1": :reg (val), \
>                         "m" (*(volatile type __force *)addr) barrier); }
>     build_mmio_write(__writel, "l", unsigned int, "r", )
>     #define writel_relaxed(v, a) __writel(v, a)
>
> Here "barrier" is an empty string. Really clever. ;-)
>
> I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
> in each single one of the drivers could amount to some degrading of overall performance and
> latency in a multicore system.
>
> As I understood Mr. Jonathan Corbet on LWN, the initiative and trend is to reduce overall
> kernel latency.
>
Sure, but there's better places to look at first than a rarely used path in a driver.
See e.g. the history of threadad NAPI (and in general the RT Linux developments) or
the optimizations deep in the network stack Eric is working on.

> Thanks,
> Mirsad
>
>>> If it still uses a LOCK addl $0, m32/m64, then it still creates 130 instances of all core
>>> bus locks for this NIC reset after the lost PHY? I'm just thinking, this is nothing
>>> authoritative ...
>>>
>>>>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>>>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>>>>> the programming sequence to reach RTL NIC registers.
>>>>>
>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>>>>
>>>>> Fixes: fe4e8db0392a6 ("r8169: fix issue with confused RX unit after PHY power-down on RTL8411b")
>>>>> Cc: Heiner Kallweit <[email protected]>
>>>>> Cc: Marco Elver <[email protected]>
>>>>> Cc: [email protected]
>>>>> Cc: "David S. Miller" <[email protected]>
>>>>> Cc: Eric Dumazet <[email protected]>
>>>>> Cc: Jakub Kicinski <[email protected]>
>>>>> Cc: Paolo Abeni <[email protected]>
>>>>> Cc: [email protected]
>>>>> Cc: [email protected]
>>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>>> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
>>>>> ---
>>>>> v3:
>>>>>    removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>>>    avoided duplication of RTL_W32() call code as advised by Heiner.
>>>>>
>>>>>    drivers/net/ethernet/realtek/r8169_main.c | 198 ++++++++--------------
>>>>>    1 file changed, 72 insertions(+), 126 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>>>> index 361b90007148..3b28bec7098b 100644
>>>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>>>> @@ -939,6 +939,32 @@ static void r8168_mac_ocp_modify(struct rtl8169_private *tp, u32 reg, u16 mask,
>>>>>        raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>>    }
>>>>>    +struct e_info_regmask_pair {
>>>>> +    u32    reg;
>>>>> +    u32    data;
>>>>> +};
>>>>> +
>>>>> +static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>> +                      const struct e_info_regmask_pair *array, int len)
>>>>> +{
>>>>> +    struct e_info_regmask_pair const *p;
>>>>> +
>>>>> +    for (p = array; len--; p++)
>>>>> +        __r8168_mac_ocp_write(tp, p->reg, p->data);
>>>>> +}
>>>>> +
>>>>> +static void _r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>> +                     const struct e_info_regmask_pair *array, int len)
>>>>> +{
>>>>> +    unsigned long flags;
>>>>> +
>>>>> +    raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>> +    __r8168_mac_ocp_write_seq(tp, array, len);
>>>>> +    raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>> +}
>>>>> +
>>>>> +#define r8168_mac_ocp_write_seq(tp, a) _r8168_mac_ocp_write_seq(tp, a, ARRAY_SIZE(a))
>>>>> +
>>>>>    /* Work around a hw issue with RTL8168g PHY, the quirk disables
>>>>>     * PHY MCU interrupts before PHY power-down.
>>>>>     */
>>>>> @@ -3107,138 +3133,58 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>>>        /* The following Realtek-provided magic fixes an issue with the RX unit
>>>>>         * getting confused after the PHY having been powered-down.
>>>>>         */
>>>>> -    r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
>>>>> +
>>>>> +    static const struct e_info_regmask_pair init_zero_seq[] = {
>>>>> +        { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>>>>> +        { 0xFC30, 0x0000 }, { 0xFC32, 0x0000 }, { 0xFC34, 0x0000 }, { 0xFC36, 0x0000 },
>>>>> +    };
>>>>> +
>>>> Don't mix code and variable declarations. Did you run checkpatch?
>>>> I think it would complain here.
>>>
>>> Thank you for the warning, I will fix it.
>>>
>>> As I said to Mr. Greg, I will do the required number of iterations to fix this issue.
>>>
>>> I will add checkpatch to my routine handling of my submitted patches.
>>>
>>> Thanks,
>>> Mirsad
>>>
>>>>> +    static const struct e_info_regmask_pair recover_seq[] = {
>>>>> +        { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>>>>> +        { 0xF808, 0xE027 }, { 0xF80A, 0xE04F }, { 0xF80C, 0xE05E }, { 0xF80E, 0xE065 },
>>>>> +        { 0xF810, 0xC602 }, { 0xF812, 0xBE00 }, { 0xF814, 0x0000 }, { 0xF816, 0xC502 },
>>>>> +        { 0xF818, 0xBD00 }, { 0xF81A, 0x074C }, { 0xF81C, 0xC302 }, { 0xF81E, 0xBB00 },
>>>>> +        { 0xF820, 0x080A }, { 0xF822, 0x6420 }, { 0xF824, 0x48C2 }, { 0xF826, 0x8C20 },
>>>>> +        { 0xF828, 0xC516 }, { 0xF82A, 0x64A4 }, { 0xF82C, 0x49C0 }, { 0xF82E, 0xF009 },
>>>>> +        { 0xF830, 0x74A2 }, { 0xF832, 0x8CA5 }, { 0xF834, 0x74A0 }, { 0xF836, 0xC50E },
>>>>> +        { 0xF838, 0x9CA2 }, { 0xF83A, 0x1C11 }, { 0xF83C, 0x9CA0 }, { 0xF83E, 0xE006 },
>>>>> +        { 0xF840, 0x74F8 }, { 0xF842, 0x48C4 }, { 0xF844, 0x8CF8 }, { 0xF846, 0xC404 },
>>>>> +        { 0xF848, 0xBC00 }, { 0xF84A, 0xC403 }, { 0xF84C, 0xBC00 }, { 0xF84E, 0x0BF2 },
>>>>> +        { 0xF850, 0x0C0A }, { 0xF852, 0xE434 }, { 0xF854, 0xD3C0 }, { 0xF856, 0x49D9 },
>>>>> +        { 0xF858, 0xF01F }, { 0xF85A, 0xC526 }, { 0xF85C, 0x64A5 }, { 0xF85E, 0x1400 },
>>>>> +        { 0xF860, 0xF007 }, { 0xF862, 0x0C01 }, { 0xF864, 0x8CA5 }, { 0xF866, 0x1C15 },
>>>>> +        { 0xF868, 0xC51B }, { 0xF86A, 0x9CA0 }, { 0xF86C, 0xE013 }, { 0xF86E, 0xC519 },
>>>>> +        { 0xF870, 0x74A0 }, { 0xF872, 0x48C4 }, { 0xF874, 0x8CA0 }, { 0xF876, 0xC516 },
>>>>> +        { 0xF878, 0x74A4 }, { 0xF87A, 0x48C8 }, { 0xF87C, 0x48CA }, { 0xF87E, 0x9CA4 },
>>>>> +        { 0xF880, 0xC512 }, { 0xF882, 0x1B00 }, { 0xF884, 0x9BA0 }, { 0xF886, 0x1B1C },
>>>>> +        { 0xF888, 0x483F }, { 0xF88A, 0x9BA2 }, { 0xF88C, 0x1B04 }, { 0xF88E, 0xC508 },
>>>>> +        { 0xF890, 0x9BA0 }, { 0xF892, 0xC505 }, { 0xF894, 0xBD00 }, { 0xF896, 0xC502 },
>>>>> +        { 0xF898, 0xBD00 }, { 0xF89A, 0x0300 }, { 0xF89C, 0x051E }, { 0xF89E, 0xE434 },
>>>>> +        { 0xF8A0, 0xE018 }, { 0xF8A2, 0xE092 }, { 0xF8A4, 0xDE20 }, { 0xF8A6, 0xD3C0 },
>>>>> +        { 0xF8A8, 0xC50F }, { 0xF8AA, 0x76A4 }, { 0xF8AC, 0x49E3 }, { 0xF8AE, 0xF007 },
>>>>> +        { 0xF8B0, 0x49C0 }, { 0xF8B2, 0xF103 }, { 0xF8B4, 0xC607 }, { 0xF8B6, 0xBE00 },
>>>>> +        { 0xF8B8, 0xC606 }, { 0xF8BA, 0xBE00 }, { 0xF8BC, 0xC602 }, { 0xF8BE, 0xBE00 },
>>>>> +        { 0xF8C0, 0x0C4C }, { 0xF8C2, 0x0C28 }, { 0xF8C4, 0x0C2C }, { 0xF8C6, 0xDC00 },
>>>>> +        { 0xF8C8, 0xC707 }, { 0xF8CA, 0x1D00 }, { 0xF8CC, 0x8DE2 }, { 0xF8CE, 0x48C1 },
>>>>> +        { 0xF8D0, 0xC502 }, { 0xF8D2, 0xBD00 }, { 0xF8D4, 0x00AA }, { 0xF8D6, 0xE0C0 },
>>>>> +        { 0xF8D8, 0xC502 }, { 0xF8DA, 0xBD00 }, { 0xF8DC, 0x0132 },
>>>>> +    };
>>>>> +
>>>>> +    static const struct e_info_regmask_pair final_seq[] = {
>>>>> +        { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>>>>> +        { 0xFC32, 0x0C25 }, { 0xFC34, 0x00A9 }, { 0xFC36, 0x012D },
>>>>> +    };
>>>>> +
>>>>> +    r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>>>        mdelay(3);
>>>>>        r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>>>>    -    r8168_mac_ocp_write(tp, 0xF800, 0xE008);
>>>>> -    r8168_mac_ocp_write(tp, 0xF802, 0xE00A);
>>>>> -    r8168_mac_ocp_write(tp, 0xF804, 0xE00C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF806, 0xE00E);
>>>>> -    r8168_mac_ocp_write(tp, 0xF808, 0xE027);
>>>>> -    r8168_mac_ocp_write(tp, 0xF80A, 0xE04F);
>>>>> -    r8168_mac_ocp_write(tp, 0xF80C, 0xE05E);
>>>>> -    r8168_mac_ocp_write(tp, 0xF80E, 0xE065);
>>>>> -    r8168_mac_ocp_write(tp, 0xF810, 0xC602);
>>>>> -    r8168_mac_ocp_write(tp, 0xF812, 0xBE00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF814, 0x0000);
>>>>> -    r8168_mac_ocp_write(tp, 0xF816, 0xC502);
>>>>> -    r8168_mac_ocp_write(tp, 0xF818, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF81A, 0x074C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF81C, 0xC302);
>>>>> -    r8168_mac_ocp_write(tp, 0xF81E, 0xBB00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF820, 0x080A);
>>>>> -    r8168_mac_ocp_write(tp, 0xF822, 0x6420);
>>>>> -    r8168_mac_ocp_write(tp, 0xF824, 0x48C2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF826, 0x8C20);
>>>>> -    r8168_mac_ocp_write(tp, 0xF828, 0xC516);
>>>>> -    r8168_mac_ocp_write(tp, 0xF82A, 0x64A4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF82C, 0x49C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF82E, 0xF009);
>>>>> -    r8168_mac_ocp_write(tp, 0xF830, 0x74A2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF832, 0x8CA5);
>>>>> -    r8168_mac_ocp_write(tp, 0xF834, 0x74A0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF836, 0xC50E);
>>>>> -    r8168_mac_ocp_write(tp, 0xF838, 0x9CA2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF83A, 0x1C11);
>>>>> -    r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF83E, 0xE006);
>>>>> -    r8168_mac_ocp_write(tp, 0xF840, 0x74F8);
>>>>> -    r8168_mac_ocp_write(tp, 0xF842, 0x48C4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF844, 0x8CF8);
>>>>> -    r8168_mac_ocp_write(tp, 0xF846, 0xC404);
>>>>> -    r8168_mac_ocp_write(tp, 0xF848, 0xBC00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF84A, 0xC403);
>>>>> -    r8168_mac_ocp_write(tp, 0xF84C, 0xBC00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF850, 0x0C0A);
>>>>> -    r8168_mac_ocp_write(tp, 0xF852, 0xE434);
>>>>> -    r8168_mac_ocp_write(tp, 0xF854, 0xD3C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF856, 0x49D9);
>>>>> -    r8168_mac_ocp_write(tp, 0xF858, 0xF01F);
>>>>> -    r8168_mac_ocp_write(tp, 0xF85A, 0xC526);
>>>>> -    r8168_mac_ocp_write(tp, 0xF85C, 0x64A5);
>>>>> -    r8168_mac_ocp_write(tp, 0xF85E, 0x1400);
>>>>> -    r8168_mac_ocp_write(tp, 0xF860, 0xF007);
>>>>> -    r8168_mac_ocp_write(tp, 0xF862, 0x0C01);
>>>>> -    r8168_mac_ocp_write(tp, 0xF864, 0x8CA5);
>>>>> -    r8168_mac_ocp_write(tp, 0xF866, 0x1C15);
>>>>> -    r8168_mac_ocp_write(tp, 0xF868, 0xC51B);
>>>>> -    r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF86C, 0xE013);
>>>>> -    r8168_mac_ocp_write(tp, 0xF86E, 0xC519);
>>>>> -    r8168_mac_ocp_write(tp, 0xF870, 0x74A0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF872, 0x48C4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF874, 0x8CA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF876, 0xC516);
>>>>> -    r8168_mac_ocp_write(tp, 0xF878, 0x74A4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF87A, 0x48C8);
>>>>> -    r8168_mac_ocp_write(tp, 0xF87C, 0x48CA);
>>>>> -    r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF880, 0xC512);
>>>>> -    r8168_mac_ocp_write(tp, 0xF882, 0x1B00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF884, 0x9BA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF886, 0x1B1C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF888, 0x483F);
>>>>> -    r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF88C, 0x1B04);
>>>>> -    r8168_mac_ocp_write(tp, 0xF88E, 0xC508);
>>>>> -    r8168_mac_ocp_write(tp, 0xF890, 0x9BA0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF892, 0xC505);
>>>>> -    r8168_mac_ocp_write(tp, 0xF894, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF896, 0xC502);
>>>>> -    r8168_mac_ocp_write(tp, 0xF898, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF89A, 0x0300);
>>>>> -    r8168_mac_ocp_write(tp, 0xF89C, 0x051E);
>>>>> -    r8168_mac_ocp_write(tp, 0xF89E, 0xE434);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A0, 0xE018);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A2, 0xE092);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8AE, 0xF007);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B2, 0xF103);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B4, 0xC607);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8B8, 0xC606);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8BC, 0xC602);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8C8, 0xC707);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D0, 0xC502);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8D8, 0xC502);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00);
>>>>> -    r8168_mac_ocp_write(tp, 0xF8DC, 0x0132);
>>>>> +    r8168_mac_ocp_write_seq(tp, recover_seq);
>>>>>          r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>>>>    -    r8168_mac_ocp_write(tp, 0xFC2A, 0x0743);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2C, 0x0801);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC30, 0x02FD);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC32, 0x0C25);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC34, 0x00A9);
>>>>> -    r8168_mac_ocp_write(tp, 0xFC36, 0x012D);
>>>>> +    r8168_mac_ocp_write_seq(tp, final_seq);
>>>>> +
>>>>>    }
>>>>>      static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)

2023-10-30 14:06:41

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

On Mon, 30 Oct 2023 at 14:53, Mirsad Todorovac
<[email protected]> wrote:
>
> On 10/30/23 14:30, Mirsad Todorovac wrote:
> >
> >
> > On 10/30/23 14:17, Heiner Kallweit wrote:
> >> On 29.10.2023 05:56, Mirsad Todorovac wrote:
> >>>
> >>>
> >>> On 10/28/23 21:21, Heiner Kallweit wrote:
> >>>> On 28.10.2023 13:05, Mirsad Goran Todorovac wrote:
> >>>>> On RTL8411b the RX unit gets confused if the PHY is powered-down.
> >>>>> This was reported in [0] and confirmed by Realtek. Realtek provided
> >>>>> a sequence to fix the RX unit after PHY wakeup.
> >>>>>
> >>>>> A series of about 130 r8168_mac_ocp_write() calls is performed to
> >>>>> program the RTL registers for recovery.
> >>>>>
> >>>>> r8168_mac_ocp_write() expands to this code:
> >>>>>
> >>>>> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
> >>>>> {
> >>>>> if (rtl_ocp_reg_failure(reg))
> >>>>> return;
> >>>>>
> >>>>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
> >>>>> }
> >>>>>
> >>>>> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
> >>>>> {
> >>>>> unsigned long flags;
> >>>>>
> >>>>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
> >>>>> __r8168_mac_ocp_write(tp, reg, data);
> >>>>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
> >>>>> }
> >>>>>
> >>>>> Register programming is done through RTL_W32() macro which expands into
> >>>>>
> >>>>> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
> >>>>>
> >>>>> which is further (on Alpha):
> >>>>>
> >>>>> extern inline void writel(u32 b, volatile void __iomem *addr)
> >>>>> {
> >>>>> mb();
> >>>>> __raw_writel(b, addr);
> >>>>> }
> >>>>>
> >>>>> or on i386/x86_64:
> >>>>>
> >>>>> #define build_mmio_write(name, size, type, reg, barrier) \
> >>>>> static inline void name(type val, volatile void __iomem *addr) \
> >>>>> { asm volatile("mov" size " %0,%1": :reg (val), \
> >>>>> "m" (*(volatile type __force *)addr) barrier); }
> >>>>>
> >>>>> build_mmio_write(writel, "l", unsigned int, "r", :"memory")
> >>>>>
> >>>>> This obviously involves iat least a compiler barrier.
> >>>>>
> >>>>> mb() expands into something like this i.e. on x86_64:
> >>>>>
> >>>>> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
> >>>>>
> >>>>> This means a whole lot of memory bus barriers: for spin_lock_irqsave(),
> >>>>> memory barrier, writel(), and spin_unlock_irqrestore().
> >>>>>
> >>>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
> >>>>> a LOCK storm that will thunder all of the cores and CPUs on the same memory controller
> >>>>> for certain time that locked memory read-modify-write cyclo or I/O takes to finish.
> >>>>>
> >>>>> In a sequential case of RTL register programming, the writes to RTL registers
> >>>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
> >>>>> number of bus stalls in a multicore or multi-CPU system:
> >>>>>
> >>>>> static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
> >>>>> const struct recover_8411b_info *array)
> >>>>> {
> >>>>> struct recover_8411b_info const *p = array;
> >>>>>
> >>>>> while (p->reg) {
> >>>>> if (!rtl_ocp_reg_failure(p->reg))
> >>>>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (p->reg << 15) | p->data);
> >>>>> p++;
> >>>>> }
> >>>>> }
> >>>>>
> >>>>> static void r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
> >>>>> const struct recover_8411b_info *array)
> >>>>> {
> >>>>> unsigned long flags;
> >>>>>
> >>>>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
> >>>>> __r8168_mac_ocp_write_seq(tp, array);
> >>>>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
> >>>>> }
> >>>>>
> >>>>> static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
> >>>>> {
> >>>>>
> >>>>> ...
> >>>>>
> >>>>> /* The following Realtek-provided magic fixes an issue with the RX unit
> >>>>> * getting confused after the PHY having been powered-down.
> >>>>> */
> >>>>>
> >>>>> static const struct recover_8411b_info init_zero_seq[] = {
> >>>>> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
> >>>>> ...
> >>>>> };
> >>>>>
> >>>>> static const struct recover_8411b_info recover_seq[] = {
> >>>>> { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
> >>>>> ...
> >>>>> };
> >>>>>
> >>>>> static const struct recover_8411b_info final_seq[] = {
> >>>>> { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
> >>>>> ...
> >>>>> };
> >>>>>
> >>>>> r8168_mac_ocp_write_seq(tp, init_zero_seq);
> >>>>> mdelay(3);
> >>>>> r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
> >>>>> r8168_mac_ocp_write_seq(tp, recover_seq);
> >>>>> r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
> >>>>> r8168_mac_ocp_write_seq(tp, final_seq);
> >>>>> }
> >>>>>
> >>>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
> >>>>> functions that only changed the function names and the ending of the line, so the actual
> >>>>> hex data is unchanged.
> >>>>>
> >>>>> Note that the original reason for the introduction of the commit fe4e8db0392a6
> >>>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
> >>>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
> >>>>
> >>>> I still have a problem with this statement as you're saying that the original
> >>>> problem still exists. I don't think that's the case.
> >>>
> >>> I will not disagree about it.
> >>>
> >>> But we have only reduced the number of spin_lock_irqsave/spin_unlock_irqrestore()
> >>> pairs.
> >>>
> >>> Maybe additionally, on the low level, memory barrier isn't required for each write to
> >>> MMIO?
> >>>
> >> One could argue whether in several places writel_relaxed() could be used.
> >> But it's not really worth it, because we're not in a hot path.
> >
> > I see. Thank you for your evaluation.
> >
> > Using writel_relaxed() sounds clever. It expands to:
> >
> > #define build_mmio_write(name, size, type, reg, barrier) \
> > static inline void name(type val, volatile void __iomem *addr) \
> > { asm volatile("mov" size " %0,%1": :reg (val), \
> > "m" (*(volatile type __force *)addr) barrier); }
> > build_mmio_write(__writel, "l", unsigned int, "r", )
> > #define writel_relaxed(v, a) __writel(v, a)
> >
> > Here "barrier" is an empty string. Really clever. ;-)
> >
> > I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
> > in each single one of the drivers could amount to some degrading of overall performance and
> > latency in a multicore system.
> >
> > As I understood Mr. Jonathan Corbet on LWN, the initiative and trend is to reduce overall
> > kernel latency.
>
> P.S.
>
> On the second thought, if barrier() is only the compiler optimisation barrier from memory
> reordering, then we do not gain much disablin git as it doesn't affect the other cores, and
> reordering MMIO writes can really confuse some NIC hardware.
>
> /* Optimization barrier */
> #ifndef barrier
> /* The "volatile" is due to gcc bugs */
> # define barrier() __asm__ __volatile__("": : :"memory")
> #endif
>
> >>> If it still uses a LOCK addl $0, m32/m64, then it still creates 130 instances of all core
> >>> bus locks for this NIC reset after the lost PHY? I'm just thinking, this is nothing
> >>> authoritative ...

I would recommend looking at the pre-processed source code if you
can't manually untangle the maze of macros. ;-)

Look at the .${obj}.cmd file (e.g.
drivers/net/ethernet/realtek/.r8169_main.o.cmd), copy the compiler
command in it, and add the "-E" option and make the compiler write the
result to some temporary file you can inspect.

2023-10-30 15:13:51

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

On 10/30/23 15:05, Marco Elver wrote:
> On Mon, 30 Oct 2023 at 14:53, Mirsad Todorovac
> <[email protected]> wrote:
>>
>> On 10/30/23 14:30, Mirsad Todorovac wrote:
>>>
>>>
>>> On 10/30/23 14:17, Heiner Kallweit wrote:
>>>> On 29.10.2023 05:56, Mirsad Todorovac wrote:
>>>>>
>>>>>
>>>>> On 10/28/23 21:21, Heiner Kallweit wrote:
>>>>>> On 28.10.2023 13:05, Mirsad Goran Todorovac wrote:
>>>>>>> On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>>>>>> This was reported in [0] and confirmed by Realtek. Realtek provided
>>>>>>> a sequence to fix the RX unit after PHY wakeup.
>>>>>>>
>>>>>>> A series of about 130 r8168_mac_ocp_write() calls is performed to
>>>>>>> program the RTL registers for recovery.
>>>>>>>
>>>>>>> r8168_mac_ocp_write() expands to this code:
>>>>>>>
>>>>>>> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>>>>> {
>>>>>>> if (rtl_ocp_reg_failure(reg))
>>>>>>> return;
>>>>>>>
>>>>>>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>>>>> }
>>>>>>>
>>>>>>> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
>>>>>>> {
>>>>>>> unsigned long flags;
>>>>>>>
>>>>>>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>>>> __r8168_mac_ocp_write(tp, reg, data);
>>>>>>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>>>> }
>>>>>>>
>>>>>>> Register programming is done through RTL_W32() macro which expands into
>>>>>>>
>>>>>>> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>>>>>
>>>>>>> which is further (on Alpha):
>>>>>>>
>>>>>>> extern inline void writel(u32 b, volatile void __iomem *addr)
>>>>>>> {
>>>>>>> mb();
>>>>>>> __raw_writel(b, addr);
>>>>>>> }
>>>>>>>
>>>>>>> or on i386/x86_64:
>>>>>>>
>>>>>>> #define build_mmio_write(name, size, type, reg, barrier) \
>>>>>>> static inline void name(type val, volatile void __iomem *addr) \
>>>>>>> { asm volatile("mov" size " %0,%1": :reg (val), \
>>>>>>> "m" (*(volatile type __force *)addr) barrier); }
>>>>>>>
>>>>>>> build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>>>>>
>>>>>>> This obviously involves iat least a compiler barrier.
>>>>>>>
>>>>>>> mb() expands into something like this i.e. on x86_64:
>>>>>>>
>>>>>>> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>>>>>
>>>>>>> This means a whole lot of memory bus barriers: for spin_lock_irqsave(),
>>>>>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>>>>>
>>>>>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>>>>>> a LOCK storm that will thunder all of the cores and CPUs on the same memory controller
>>>>>>> for certain time that locked memory read-modify-write cyclo or I/O takes to finish.
>>>>>>>
>>>>>>> In a sequential case of RTL register programming, the writes to RTL registers
>>>>>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>>>>>> number of bus stalls in a multicore or multi-CPU system:
>>>>>>>
>>>>>>> static void __r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>>>> const struct recover_8411b_info *array)
>>>>>>> {
>>>>>>> struct recover_8411b_info const *p = array;
>>>>>>>
>>>>>>> while (p->reg) {
>>>>>>> if (!rtl_ocp_reg_failure(p->reg))
>>>>>>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (p->reg << 15) | p->data);
>>>>>>> p++;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> static void r8168_mac_ocp_write_seq(struct rtl8169_private *tp,
>>>>>>> const struct recover_8411b_info *array)
>>>>>>> {
>>>>>>> unsigned long flags;
>>>>>>>
>>>>>>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>>>> __r8168_mac_ocp_write_seq(tp, array);
>>>>>>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>>>> }
>>>>>>>
>>>>>>> static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>>>>> {
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> /* The following Realtek-provided magic fixes an issue with the RX unit
>>>>>>> * getting confused after the PHY having been powered-down.
>>>>>>> */
>>>>>>>
>>>>>>> static const struct recover_8411b_info init_zero_seq[] = {
>>>>>>> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
>>>>>>> ...
>>>>>>> };
>>>>>>>
>>>>>>> static const struct recover_8411b_info recover_seq[] = {
>>>>>>> { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
>>>>>>> ...
>>>>>>> };
>>>>>>>
>>>>>>> static const struct recover_8411b_info final_seq[] = {
>>>>>>> { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
>>>>>>> ...
>>>>>>> };
>>>>>>>
>>>>>>> r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>>>>> mdelay(3);
>>>>>>> r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
>>>>>>> r8168_mac_ocp_write_seq(tp, recover_seq);
>>>>>>> r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
>>>>>>> r8168_mac_ocp_write_seq(tp, final_seq);
>>>>>>> }
>>>>>>>
>>>>>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>>>>>> functions that only changed the function names and the ending of the line, so the actual
>>>>>>> hex data is unchanged.
>>>>>>>
>>>>>>> Note that the original reason for the introduction of the commit fe4e8db0392a6
>>>>>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>>>>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>>>>>
>>>>>> I still have a problem with this statement as you're saying that the original
>>>>>> problem still exists. I don't think that's the case.
>>>>>
>>>>> I will not disagree about it.
>>>>>
>>>>> But we have only reduced the number of spin_lock_irqsave/spin_unlock_irqrestore()
>>>>> pairs.
>>>>>
>>>>> Maybe additionally, on the low level, memory barrier isn't required for each write to
>>>>> MMIO?
>>>>>
>>>> One could argue whether in several places writel_relaxed() could be used.
>>>> But it's not really worth it, because we're not in a hot path.
>>>
>>> I see. Thank you for your evaluation.
>>>
>>> Using writel_relaxed() sounds clever. It expands to:
>>>
>>> #define build_mmio_write(name, size, type, reg, barrier) \
>>> static inline void name(type val, volatile void __iomem *addr) \
>>> { asm volatile("mov" size " %0,%1": :reg (val), \
>>> "m" (*(volatile type __force *)addr) barrier); }
>>> build_mmio_write(__writel, "l", unsigned int, "r", )
>>> #define writel_relaxed(v, a) __writel(v, a)
>>>
>>> Here "barrier" is an empty string. Really clever. ;-)
>>>
>>> I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
>>> in each single one of the drivers could amount to some degrading of overall performance and
>>> latency in a multicore system.
>>>
>>> As I understood Mr. Jonathan Corbet on LWN, the initiative and trend is to reduce overall
>>> kernel latency.
>>
>> P.S.
>>
>> On the second thought, if barrier() is only the compiler optimisation barrier from memory
>> reordering, then we do not gain much disablin git as it doesn't affect the other cores, and
>> reordering MMIO writes can really confuse some NIC hardware.
>>
>> /* Optimization barrier */
>> #ifndef barrier
>> /* The "volatile" is due to gcc bugs */
>> # define barrier() __asm__ __volatile__("": : :"memory")
>> #endif
>>
>>>>> If it still uses a LOCK addl $0, m32/m64, then it still creates 130 instances of all core
>>>>> bus locks for this NIC reset after the lost PHY? I'm just thinking, this is nothing
>>>>> authoritative ...
>
> I would recommend looking at the pre-processed source code if you
> can't manually untangle the maze of macros. ;-)
>
> Look at the .${obj}.cmd file (e.g.
> drivers/net/ethernet/realtek/.r8169_main.o.cmd), copy the compiler
> command in it, and add the "-E" option and make the compiler write the
> result to some temporary file you can inspect.

Thanks for the tip. I tried to add

EXTRA_CFLAGS += -save-temps

to the driver/net/ethernet/realtek/Kbuild, but something went wrong. Now I see that it is deprecated.

Best regards,
Mirsad Todorovac

2023-10-31 01:21:55

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

> I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
> in each single one of the drivers could amount to some degrading of overall performance and
> latency in a multicore system.

For optimisations, we like to see benchmark results which show some
improvements. Do you have any numbers?

Andrew

2023-10-31 03:35:39

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

On 10/31/23 02:21, Andrew Lunn wrote:
>> I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
>> in each single one of the drivers could amount to some degrading of overall performance and
>> latency in a multicore system.
>
> For optimisations, we like to see benchmark results which show some
> improvements. Do you have any numbers?

Hi, Andrew,

Thank you for your interest in RTL NIC driver optimisations.

My knowledge about the timing costs of synchronisation is mostly theoretical.

By the table below, the cost of LOCK CHPXCHG m,616/32/64 is 52 cycles (in case the
memory location is already in the L3 cache I suppose, cache miss can makes things worse,
but we are hitting the same lock repeatedly).

So, we have (in the best case) 52 clocks for lock, 52 for unlock (for the uncontentended
case).

This means 104 cycle x 130 sequential lock+unlock pairs = 13520 clk ~ 3.38 us on a 4 GHz
machine (5.633 us on a 2.4 GHz CPU) while the multicore system does nothing but locking
and unclocking after therecovery from the loss of PHY.

We are talking about the RTl8411b.

The other case it the rather new 8125/8125b.

static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
{
rtl_pcie_state_l2l3_disable(tp);

RTL_W16(tp, 0x382, 0x221b);
RTL_W8(tp, 0x4500, 0);
RTL_W16(tp, 0x4800, 0);

/* disable UPS */
r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);

RTL_W8(tp, Config1, RTL_R8(tp, Config1) & ~0x10);

r8168_mac_ocp_write(tp, 0xc140, 0xffff);
r8168_mac_ocp_write(tp, 0xc142, 0xffff);

r8168_mac_ocp_modify(tp, 0xd3e2, 0x0fff, 0x03a9);
r8168_mac_ocp_modify(tp, 0xd3e4, 0x00ff, 0x0000);
r8168_mac_ocp_modify(tp, 0xe860, 0x0000, 0x0080);

/* disable new tx descriptor format */
r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);

if (tp->mac_version == RTL_GIGA_MAC_VER_63)
r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
else
r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);

if (tp->mac_version == RTL_GIGA_MAC_VER_63)
r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
else
r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);

r8168_mac_ocp_modify(tp, 0xc0b4, 0x0000, 0x000c);
r8168_mac_ocp_modify(tp, 0xeb6a, 0x00ff, 0x0033);
r8168_mac_ocp_modify(tp, 0xeb50, 0x03e0, 0x0040);
r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0030);
r8168_mac_ocp_modify(tp, 0xe040, 0x1000, 0x0000);
r8168_mac_ocp_modify(tp, 0xea1c, 0x0003, 0x0001);
r8168_mac_ocp_modify(tp, 0xe0c0, 0x4f0f, 0x4403);
r8168_mac_ocp_modify(tp, 0xe052, 0x0080, 0x0068);
r8168_mac_ocp_modify(tp, 0xd430, 0x0fff, 0x047f);

r8168_mac_ocp_modify(tp, 0xea1c, 0x0004, 0x0000);
r8168_mac_ocp_modify(tp, 0xeb54, 0x0000, 0x0001);
udelay(1);
r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);

r8168_mac_ocp_write(tp, 0xe098, 0xc302);

rtl_loop_wait_low(tp, &rtl_mac_ocp_e00e_cond, 1000, 10);

if (tp->mac_version == RTL_GIGA_MAC_VER_63)
rtl8125b_config_eee_mac(tp);
else
rtl8125a_config_eee_mac(tp);

rtl_disable_rxdvgate(tp);
}

There are 22 calls to mac ocp write/modify, each having a raw_spin_lock_irqsave()/
raw_spin_unlock_irqrestore() pair.

The minimal timing cost is 22x104 = 2288 cycles or a total of 0.572 us on a 4 GHz CPU,
preventing all cores from the memory access. (0.953 us on a 2.4 GHz system).

This does happen only at the startup apparently, but also at each rtl_reset_work(tp)
and rtl8169_up(tp), rtl_open() at driver load, and rtl8169_runtime_resume() called by
the rtl8169_resume().

The revised version does only 2 pairs of raw_spin_lock_irqsave()/
raw_sping_unlock_irqrestore(), but I admit it might be harder to read and maintain.

My knowledge of the Linux kernel network stack isn't that deep, and I can't tell how
often rtl8169_resume() is called in real life and how to reproduce and benchmark that
call.

This is a sort of a "loophole optimisation", I don't have the numbers on the impact
on the entire Linux on i.e. 16 core/32 threads Ryzen 9 7950X CPU.

Probably things would be worse on a 64 core CPU, because the 5.633 us and 0.953 us
would have to be multiplied by the number of blocked cores, amounting to estimated
total CPU cost of 360 us and 61 us respectively. Per NIC reset.

However, I am relatively new to the r8169 driver, I came across while analysing
KCSAN reports since the beginning of August. So ATM I don't have any benchmarks
to confirm the theoretical findings.

I realise that maintainers have to cherry pick patches or the kernel of +35M lines
would become way too cluttered and unmaintainable.

NOTE: The numbers are only valid for a x86_64 system.

Regards,
Mirsad

SOURCE AND REFERENCES:

Synchronization (Source: [1])
===============================================
instruction operands ops latency
===============================================
LOCK ADD m,r 1 ~55
XADD m,r 4 10
LOCK XADD m,r 4 ~51
CMPXCHG m8,r8 5 15
LOCK CMPXCHG m8,r8 5 ~51
CMPXCHG m,r16/32/64 6 14
LOCK CMPXCHG m,r16/32/64 6 ~52
CMPXCHG8B m64 18 15
LOCK CMPXCHG8B m64 18 ~53
CMPXCHG16B m128 22 52
LOCK CMPXCHG16B m128 22 ~94
===============================================

Explanation of column headings:
Instruction:
Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.

Operands:
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory
operand including indirect operands, m64 means 64-bit memory operand, etc.
Ops:
Number of macro-operations issued from instruction decoder to schedulers. In-
structions with more than 2 macro-operations use microcode.
Latency:
This is the delay that the instruction generates in a dependency chain. The num-
bers are minimum values. Cache misses, misalignment, and exceptions may in-
crease the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the listing
for register and memory operand are joined (r/m).

[1] TL;DR https://www.agner.org/optimize/instruction_tables.pdf
[2] TL;DR http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf

2023-10-31 12:38:50

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

On Tue, Oct 31, 2023 at 04:35:19AM +0100, Mirsad Todorovac wrote:
> On 10/31/23 02:21, Andrew Lunn wrote:
> > > I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
> > > in each single one of the drivers could amount to some degrading of overall performance and
> > > latency in a multicore system.
> >
> > For optimisations, we like to see benchmark results which show some
> > improvements. Do you have any numbers?
>
> Hi, Andrew,
>
> Thank you for your interest in RTL NIC driver optimisations.
>
> My knowledge about the timing costs of synchronisation is mostly theoretical.

The kernel tends to be very practical. Maybe try to turn the
theoretical knowledge into practice. Write a benchmark test, or see if
any of the existing RT Linux tests show there is a real problem here,
and your changes fix it.

Andrew