2023-11-04 22:18:14

by Mirsad Todorovac

[permalink] [raw]
Subject: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention

The motivation for these helpers was the locking overhead of 130 consecutive
r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
if the PHY is powered-down.

To quote Heiner:

On RTL8411b the RX unit gets confused if the PHY is powered-down.
This was reported in [0] and confirmed by Realtek. Realtek provided
a sequence to fix the RX unit after PHY wakeup.

A series of about 130 r8168_mac_ocp_write() calls is performed to program the
RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
spin_unlock_irqrestore().

Each mac ocp write is made of:

static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
u32 data)
{
if (rtl_ocp_reg_failure(reg))
return;

RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
}

static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
u32 data)
{
unsigned long flags;

raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
__r8168_mac_ocp_write(tp, reg, data);
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
}

Register programming is done through RTL_W32() macro which expands into

#define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))

which is further (on Alpha):

extern inline void writel(u32 b, volatile void __iomem *addr)
{
mb();
__raw_writel(b, addr);
}

or on i386/x86_64:

#define build_mmio_write(name, size, type, reg, barrier) \
static inline void name(type val, volatile void __iomem *addr) \
{ asm volatile("mov" size " %0,%1": :reg (val), \
"m" (*(volatile type __force *)addr) barrier); }

build_mmio_write(writel, "l", unsigned int, "r", :"memory")

This obviously involves iat least a compiler barrier.

mb() expands into something like this i.e. on x86_64:

#define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")

This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
memory barrier, writel(), and spin_unlock_irqrestore().

With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
a lock storm that will stall all of the cores and CPUs on the same memory controller
for certain time I/O takes to finish.

In a sequential case of RTL register programming, the writes to RTL registers
can be coalesced under a same raw spinlock. This can dramatically decrease the
number of bus stalls in a multicore or multi-CPU system.

Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
provided to reduce lock contention:

static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
{

...

/* The following Realtek-provided magic fixes an issue with the RX unit
* getting confused after the PHY having been powered-down.
*/

static const struct recover_8411b_info init_zero_seq[] = {
{ 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
...
};

...

r8168_mac_ocp_write_seq(tp, init_zero_seq);

...

}

The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
functions that only changed the function names and the ending of the line, so the actual
hex data is unchanged.

To repeat, the reason for the introduction of the original commit
was to enable recovery of the RX unit on the RTL8411b which was confused by the
powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
into a series of about 500+ memory bus locks, most waiting for the main memory read,
modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
the programming sequence to reach RTL NIC registers.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075

v6:
proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.

v5:
attempted some new optimisations, which were rejected, but not all and not completely.

v4:
fixed complaints as advised by Heiner and checkpatch.pl.
split the patch into five sections to be more easily manipulated and reviewed
introduced r8168_mac_ocp_write_seq()
applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B

v3:
removed register/mask pair array sentinels, so using ARRAY_SIZE().
avoided duplication of RTL_W32() call code as advised by Heiner.

Mirsad Goran Todorovac (5):
r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
stalls
r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
spinlock stalls
r8169: Coalesce mac ocp write and modify for 8168H start to reduce
spinlocks
r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
spinlock contention
r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
spinlocks

drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
1 file changed, 150 insertions(+), 154 deletions(-)

--
2.34.1


2023-11-04 22:18:31

by Mirsad Todorovac

[permalink] [raw]
Subject: [PATCH net-next v6 4/5] r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce spinlock contention

Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
the startup of 8125 and 8125B involve implicit spin_lock_irqsave() and
spin_unlock_irqrestore() on each invocation.

Coalesced with the corresponding helpers r8168_mac_ocp_write_seq() and
r8168_mac_ocp_modify_seq() into sequential write or modidy with a sinqle lock/unlock,
these calls reduce overall lock contention.

Fixes: f1bce4ad2f1ce ("r8169: add support for RTL8125")
Fixes: 0439297be9511 ("r8169: add support for RTL8125B")
Cc: Heiner Kallweit <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: [email protected]
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mirsad Goran Todorovac <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
---
drivers/net/ethernet/realtek/r8169_main.c | 38 ++++++++++++++---------
1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 0fb34d217205..056fe5b3930b 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -3539,6 +3539,27 @@ DECLARE_RTL_COND(rtl_mac_ocp_e00e_cond)

static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
{
+
+ static const struct e_info_regmaskset e_info_8125_common_1[] = {
+ { 0xd3e2, 0x0fff, 0x03a9 },
+ { 0xd3e4, 0x00ff, 0x0000 },
+ { 0xe860, 0x0000, 0x0080 },
+ };
+
+ static const struct e_info_regmaskset e_info_8125_common_2[] = {
+ { 0xc0b4, 0x0000, 0x000c },
+ { 0xeb6a, 0x00ff, 0x0033 },
+ { 0xeb50, 0x03e0, 0x0040 },
+ { 0xe056, 0x00f0, 0x0030 },
+ { 0xe040, 0x1000, 0x0000 },
+ { 0xea1c, 0x0003, 0x0001 },
+ { 0xe0c0, 0x4f0f, 0x4403 },
+ { 0xe052, 0x0080, 0x0068 },
+ { 0xd430, 0x0fff, 0x047f },
+ { 0xea1c, 0x0004, 0x0000 },
+ { 0xeb54, 0x0000, 0x0001 },
+ };
+
rtl_pcie_state_l2l3_disable(tp);

RTL_W16(tp, 0x382, 0x221b);
@@ -3553,9 +3574,7 @@ static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
r8168_mac_ocp_write(tp, 0xc140, 0xffff);
r8168_mac_ocp_write(tp, 0xc142, 0xffff);

- r8168_mac_ocp_modify(tp, 0xd3e2, 0x0fff, 0x03a9);
- r8168_mac_ocp_modify(tp, 0xd3e4, 0x00ff, 0x0000);
- r8168_mac_ocp_modify(tp, 0xe860, 0x0000, 0x0080);
+ r8168_mac_ocp_modify_seq(tp, e_info_8125_common_1);

/* disable new tx descriptor format */
r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
@@ -3570,18 +3589,7 @@ static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
else
r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);

- r8168_mac_ocp_modify(tp, 0xc0b4, 0x0000, 0x000c);
- r8168_mac_ocp_modify(tp, 0xeb6a, 0x00ff, 0x0033);
- r8168_mac_ocp_modify(tp, 0xeb50, 0x03e0, 0x0040);
- r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0030);
- r8168_mac_ocp_modify(tp, 0xe040, 0x1000, 0x0000);
- r8168_mac_ocp_modify(tp, 0xea1c, 0x0003, 0x0001);
- r8168_mac_ocp_modify(tp, 0xe0c0, 0x4f0f, 0x4403);
- r8168_mac_ocp_modify(tp, 0xe052, 0x0080, 0x0068);
- r8168_mac_ocp_modify(tp, 0xd430, 0x0fff, 0x047f);
-
- r8168_mac_ocp_modify(tp, 0xea1c, 0x0004, 0x0000);
- r8168_mac_ocp_modify(tp, 0xeb54, 0x0000, 0x0001);
+ r8168_mac_ocp_modify_seq(tp, e_info_8125_common_2);
udelay(1);
r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
--
2.34.1

2023-11-04 22:18:38

by Mirsad Todorovac

[permalink] [raw]
Subject: [PATCH net-next v6 5/5] r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce spinlocks

Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
the init sequence of the 8125 involve implicit spin_lock_irqsave() and
spin_unlock_irqrestore() on each invocation.

Coalesced with the corresponding helpers r8168_mac_ocp_write_seq() and
r8168_mac_ocp_modify_seq() into sequential write or modidy with a sinqle lock/unlock,
these calls reduce overall lock contention.

Fixes: f1bce4ad2f1ce ("r8169: add support for RTL8125")
Cc: Heiner Kallweit <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: [email protected]
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mirsad Goran Todorovac <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
---
drivers/net/ethernet/realtek/r8169_main.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 056fe5b3930b..42f0a7486151 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -5074,6 +5074,12 @@ static void rtl_hw_init_8168g(struct rtl8169_private *tp)

static void rtl_hw_init_8125(struct rtl8169_private *tp)
{
+ static const struct e_info_regdata hw_init_8125_1[] = {
+ { 0xc0aa, 0x07d0 },
+ { 0xc0a6, 0x0150 },
+ { 0xc01e, 0x5555 },
+ };
+
rtl_enable_rxdvgate(tp);

RTL_W8(tp, ChipCmd, RTL_R8(tp, ChipCmd) & ~(CmdTxEnb | CmdRxEnb));
@@ -5083,9 +5089,7 @@ static void rtl_hw_init_8125(struct rtl8169_private *tp)
r8168_mac_ocp_modify(tp, 0xe8de, BIT(14), 0);
r8168g_wait_ll_share_fifo_ready(tp);

- r8168_mac_ocp_write(tp, 0xc0aa, 0x07d0);
- r8168_mac_ocp_write(tp, 0xc0a6, 0x0150);
- r8168_mac_ocp_write(tp, 0xc01e, 0x5555);
+ r8168_mac_ocp_write_seq(tp, hw_init_8125_1);
r8168g_wait_ll_share_fifo_ready(tp);
}

--
2.34.1

2023-11-04 22:18:42

by Mirsad Todorovac

[permalink] [raw]
Subject: [PATCH net-next v6 1/5] r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock stalls

A pair of new helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq()
are introduced.

They are meant to minimise the locking and unlocking overhead when just assuring
the sequential mac ocp register programming according to the Realtek specs would do.
The latter is assured by the compiler optimisation "barrier" in the writev() call
called by the low-level RTL_W32() primitive.

Cc: Heiner Kallweit <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: [email protected]
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mirsad Goran Todorovac <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
---
v6:
proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.

v5:
attempted some new optimisations, which were rejected, but not all and not completely.

v4:
fixed complaints as advised by Heiner and checkpatch.pl.
split the patch into five sections to be more easily manipulated and reviewed
introduced r8168_mac_ocp_write_seq()
applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B

v3:
removed register/mask pair array sentinels, so using ARRAY_SIZE().
avoided duplication of RTL_W32() call code as advised by Heiner.

drivers/net/ethernet/realtek/r8169_main.c | 57 +++++++++++++++++++++++
1 file changed, 57 insertions(+)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index a987defb575c..e39b5777d67b 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -939,6 +939,63 @@ static void r8168_mac_ocp_modify(struct rtl8169_private *tp, u32 reg, u16 mask,
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
}

+struct e_info_regdata {
+ u32 reg;
+ u32 data;
+};
+
+struct e_info_regmaskset {
+ u32 reg;
+ u16 mask;
+ u16 set;
+};
+
+static void __r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp,
+ const struct e_info_regdata *array, int len)
+{
+ struct e_info_regdata const *p;
+
+ for (p = array; len--; p++)
+ __r8168_mac_ocp_write(tp, p->reg, p->data);
+}
+
+static void r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp,
+ const struct e_info_regdata *array, int len)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
+ __r8168_mac_ocp_write_seqlen(tp, array, len);
+ raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
+}
+
+static void __r8168_mac_ocp_modify_seqlen(struct rtl8169_private *tp,
+ const struct e_info_regmaskset *array, int len)
+{
+ struct e_info_regmaskset const *p;
+ u16 data;
+
+ for (p = array; len--; p++) {
+ data = __r8168_mac_ocp_read(tp, p->reg);
+ __r8168_mac_ocp_write(tp, p->reg, (data & ~p->mask) | p->set);
+ }
+}
+
+static void r8168_mac_ocp_modify_seqlen(struct rtl8169_private *tp,
+ const struct e_info_regmaskset *array, int len)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
+ __r8168_mac_ocp_modify_seqlen(tp, array, len);
+ raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
+}
+
+#define r8168_mac_ocp_write_seq(tp, a) r8168_mac_ocp_write_seqlen(tp, a, ARRAY_SIZE(a))
+#define r8168_mac_ocp_modify_seq(tp, a) r8168_mac_ocp_modify_seqlen(tp, a, ARRAY_SIZE(a))
+#define __r8168_mac_ocp_write_seq(tp, a) __r8168_mac_ocp_write_seqlen(tp, a, ARRAY_SIZE(a))
+#define __r8168_mac_ocp_modify_seq(tp, a) __r8168_mac_ocp_modify_seqlen(tp, a, ARRAY_SIZE(a))
+
/* Work around a hw issue with RTL8168g PHY, the quirk disables
* PHY MCU interrupts before PHY power-down.
*/
--
2.34.1

2023-11-04 22:18:44

by Mirsad Todorovac

[permalink] [raw]
Subject: [PATCH net-next v6 2/5] r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce spinlock stalls

On RTL8411b the RX unit gets confused if the PHY is powered-down.
This was reported in [0] and confirmed by Realtek. Realtek provided
a sequence to fix the RX unit after PHY wakeup.

A series of about 130 r8168_mac_ocp_write() calls is performed to
program the RTL registers for recovery.

With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
a lock storm that will stall all of the cores sharing the same cache for certain
time I/O takes to finish.

In a sequential case of RTL register programming, a sequence of writes to the RTL
registers can be coalesced under a same raw spinlock. This can dramatically decrease
the number of bus stalls in a multicore or multi-CPU system.

Cc: Heiner Kallweit <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: [email protected]
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mirsad Goran Todorovac <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
---
v6:
proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.

v5:
attempted some new optimisations, which were rejected, but not all and not completely.

v4:
fixed complaints as advised by Heiner and checkpatch.pl.
split the patch into five sections to be more easily manipulated and reviewed
introduced r8168_mac_ocp_write_seq()
applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B

v3:
removed register/mask pair array sentinels, so using ARRAY_SIZE().
avoided duplication of RTL_W32() call code as advised by Heiner.

drivers/net/ethernet/realtek/r8169_main.c | 173 ++++++----------------
1 file changed, 46 insertions(+), 127 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index e39b5777d67b..5515c51b6e3c 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -3157,145 +3157,64 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
{ 0x1d, 0x0000, 0x4000 },
};

- rtl_hw_start_8168g(tp);
+ static const struct e_info_regdata init_zero_seq[] = {
+ { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
+ { 0xFC30, 0x0000 }, { 0xFC32, 0x0000 }, { 0xFC34, 0x0000 }, { 0xFC36, 0x0000 },
+ };

+ static const struct e_info_regdata recover_seq[] = {
+ { 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
+ { 0xF808, 0xE027 }, { 0xF80A, 0xE04F }, { 0xF80C, 0xE05E }, { 0xF80E, 0xE065 },
+ { 0xF810, 0xC602 }, { 0xF812, 0xBE00 }, { 0xF814, 0x0000 }, { 0xF816, 0xC502 },
+ { 0xF818, 0xBD00 }, { 0xF81A, 0x074C }, { 0xF81C, 0xC302 }, { 0xF81E, 0xBB00 },
+ { 0xF820, 0x080A }, { 0xF822, 0x6420 }, { 0xF824, 0x48C2 }, { 0xF826, 0x8C20 },
+ { 0xF828, 0xC516 }, { 0xF82A, 0x64A4 }, { 0xF82C, 0x49C0 }, { 0xF82E, 0xF009 },
+ { 0xF830, 0x74A2 }, { 0xF832, 0x8CA5 }, { 0xF834, 0x74A0 }, { 0xF836, 0xC50E },
+ { 0xF838, 0x9CA2 }, { 0xF83A, 0x1C11 }, { 0xF83C, 0x9CA0 }, { 0xF83E, 0xE006 },
+ { 0xF840, 0x74F8 }, { 0xF842, 0x48C4 }, { 0xF844, 0x8CF8 }, { 0xF846, 0xC404 },
+ { 0xF848, 0xBC00 }, { 0xF84A, 0xC403 }, { 0xF84C, 0xBC00 }, { 0xF84E, 0x0BF2 },
+ { 0xF850, 0x0C0A }, { 0xF852, 0xE434 }, { 0xF854, 0xD3C0 }, { 0xF856, 0x49D9 },
+ { 0xF858, 0xF01F }, { 0xF85A, 0xC526 }, { 0xF85C, 0x64A5 }, { 0xF85E, 0x1400 },
+ { 0xF860, 0xF007 }, { 0xF862, 0x0C01 }, { 0xF864, 0x8CA5 }, { 0xF866, 0x1C15 },
+ { 0xF868, 0xC51B }, { 0xF86A, 0x9CA0 }, { 0xF86C, 0xE013 }, { 0xF86E, 0xC519 },
+ { 0xF870, 0x74A0 }, { 0xF872, 0x48C4 }, { 0xF874, 0x8CA0 }, { 0xF876, 0xC516 },
+ { 0xF878, 0x74A4 }, { 0xF87A, 0x48C8 }, { 0xF87C, 0x48CA }, { 0xF87E, 0x9CA4 },
+ { 0xF880, 0xC512 }, { 0xF882, 0x1B00 }, { 0xF884, 0x9BA0 }, { 0xF886, 0x1B1C },
+ { 0xF888, 0x483F }, { 0xF88A, 0x9BA2 }, { 0xF88C, 0x1B04 }, { 0xF88E, 0xC508 },
+ { 0xF890, 0x9BA0 }, { 0xF892, 0xC505 }, { 0xF894, 0xBD00 }, { 0xF896, 0xC502 },
+ { 0xF898, 0xBD00 }, { 0xF89A, 0x0300 }, { 0xF89C, 0x051E }, { 0xF89E, 0xE434 },
+ { 0xF8A0, 0xE018 }, { 0xF8A2, 0xE092 }, { 0xF8A4, 0xDE20 }, { 0xF8A6, 0xD3C0 },
+ { 0xF8A8, 0xC50F }, { 0xF8AA, 0x76A4 }, { 0xF8AC, 0x49E3 }, { 0xF8AE, 0xF007 },
+ { 0xF8B0, 0x49C0 }, { 0xF8B2, 0xF103 }, { 0xF8B4, 0xC607 }, { 0xF8B6, 0xBE00 },
+ { 0xF8B8, 0xC606 }, { 0xF8BA, 0xBE00 }, { 0xF8BC, 0xC602 }, { 0xF8BE, 0xBE00 },
+ { 0xF8C0, 0x0C4C }, { 0xF8C2, 0x0C28 }, { 0xF8C4, 0x0C2C }, { 0xF8C6, 0xDC00 },
+ { 0xF8C8, 0xC707 }, { 0xF8CA, 0x1D00 }, { 0xF8CC, 0x8DE2 }, { 0xF8CE, 0x48C1 },
+ { 0xF8D0, 0xC502 }, { 0xF8D2, 0xBD00 }, { 0xF8D4, 0x00AA }, { 0xF8D6, 0xE0C0 },
+ { 0xF8D8, 0xC502 }, { 0xF8DA, 0xBD00 }, { 0xF8DC, 0x0132 },
+ };
+
+ static const struct e_info_regdata final_seq[] = {
+ { 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
+ { 0xFC32, 0x0C25 }, { 0xFC34, 0x00A9 }, { 0xFC36, 0x012D },
+ };
+
+ rtl_hw_start_8168g(tp);
rtl_ephy_init(tp, e_info_8411_2);

/* The following Realtek-provided magic fixes an issue with the RX unit
* getting confused after the PHY having been powered-down.
*/
- r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
- r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
+
+ r8168_mac_ocp_write_seq(tp, init_zero_seq);
mdelay(3);
r8168_mac_ocp_write(tp, 0xFC26, 0x0000);

- r8168_mac_ocp_write(tp, 0xF800, 0xE008);
- r8168_mac_ocp_write(tp, 0xF802, 0xE00A);
- r8168_mac_ocp_write(tp, 0xF804, 0xE00C);
- r8168_mac_ocp_write(tp, 0xF806, 0xE00E);
- r8168_mac_ocp_write(tp, 0xF808, 0xE027);
- r8168_mac_ocp_write(tp, 0xF80A, 0xE04F);
- r8168_mac_ocp_write(tp, 0xF80C, 0xE05E);
- r8168_mac_ocp_write(tp, 0xF80E, 0xE065);
- r8168_mac_ocp_write(tp, 0xF810, 0xC602);
- r8168_mac_ocp_write(tp, 0xF812, 0xBE00);
- r8168_mac_ocp_write(tp, 0xF814, 0x0000);
- r8168_mac_ocp_write(tp, 0xF816, 0xC502);
- r8168_mac_ocp_write(tp, 0xF818, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF81A, 0x074C);
- r8168_mac_ocp_write(tp, 0xF81C, 0xC302);
- r8168_mac_ocp_write(tp, 0xF81E, 0xBB00);
- r8168_mac_ocp_write(tp, 0xF820, 0x080A);
- r8168_mac_ocp_write(tp, 0xF822, 0x6420);
- r8168_mac_ocp_write(tp, 0xF824, 0x48C2);
- r8168_mac_ocp_write(tp, 0xF826, 0x8C20);
- r8168_mac_ocp_write(tp, 0xF828, 0xC516);
- r8168_mac_ocp_write(tp, 0xF82A, 0x64A4);
- r8168_mac_ocp_write(tp, 0xF82C, 0x49C0);
- r8168_mac_ocp_write(tp, 0xF82E, 0xF009);
- r8168_mac_ocp_write(tp, 0xF830, 0x74A2);
- r8168_mac_ocp_write(tp, 0xF832, 0x8CA5);
- r8168_mac_ocp_write(tp, 0xF834, 0x74A0);
- r8168_mac_ocp_write(tp, 0xF836, 0xC50E);
- r8168_mac_ocp_write(tp, 0xF838, 0x9CA2);
- r8168_mac_ocp_write(tp, 0xF83A, 0x1C11);
- r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0);
- r8168_mac_ocp_write(tp, 0xF83E, 0xE006);
- r8168_mac_ocp_write(tp, 0xF840, 0x74F8);
- r8168_mac_ocp_write(tp, 0xF842, 0x48C4);
- r8168_mac_ocp_write(tp, 0xF844, 0x8CF8);
- r8168_mac_ocp_write(tp, 0xF846, 0xC404);
- r8168_mac_ocp_write(tp, 0xF848, 0xBC00);
- r8168_mac_ocp_write(tp, 0xF84A, 0xC403);
- r8168_mac_ocp_write(tp, 0xF84C, 0xBC00);
- r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2);
- r8168_mac_ocp_write(tp, 0xF850, 0x0C0A);
- r8168_mac_ocp_write(tp, 0xF852, 0xE434);
- r8168_mac_ocp_write(tp, 0xF854, 0xD3C0);
- r8168_mac_ocp_write(tp, 0xF856, 0x49D9);
- r8168_mac_ocp_write(tp, 0xF858, 0xF01F);
- r8168_mac_ocp_write(tp, 0xF85A, 0xC526);
- r8168_mac_ocp_write(tp, 0xF85C, 0x64A5);
- r8168_mac_ocp_write(tp, 0xF85E, 0x1400);
- r8168_mac_ocp_write(tp, 0xF860, 0xF007);
- r8168_mac_ocp_write(tp, 0xF862, 0x0C01);
- r8168_mac_ocp_write(tp, 0xF864, 0x8CA5);
- r8168_mac_ocp_write(tp, 0xF866, 0x1C15);
- r8168_mac_ocp_write(tp, 0xF868, 0xC51B);
- r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0);
- r8168_mac_ocp_write(tp, 0xF86C, 0xE013);
- r8168_mac_ocp_write(tp, 0xF86E, 0xC519);
- r8168_mac_ocp_write(tp, 0xF870, 0x74A0);
- r8168_mac_ocp_write(tp, 0xF872, 0x48C4);
- r8168_mac_ocp_write(tp, 0xF874, 0x8CA0);
- r8168_mac_ocp_write(tp, 0xF876, 0xC516);
- r8168_mac_ocp_write(tp, 0xF878, 0x74A4);
- r8168_mac_ocp_write(tp, 0xF87A, 0x48C8);
- r8168_mac_ocp_write(tp, 0xF87C, 0x48CA);
- r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4);
- r8168_mac_ocp_write(tp, 0xF880, 0xC512);
- r8168_mac_ocp_write(tp, 0xF882, 0x1B00);
- r8168_mac_ocp_write(tp, 0xF884, 0x9BA0);
- r8168_mac_ocp_write(tp, 0xF886, 0x1B1C);
- r8168_mac_ocp_write(tp, 0xF888, 0x483F);
- r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2);
- r8168_mac_ocp_write(tp, 0xF88C, 0x1B04);
- r8168_mac_ocp_write(tp, 0xF88E, 0xC508);
- r8168_mac_ocp_write(tp, 0xF890, 0x9BA0);
- r8168_mac_ocp_write(tp, 0xF892, 0xC505);
- r8168_mac_ocp_write(tp, 0xF894, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF896, 0xC502);
- r8168_mac_ocp_write(tp, 0xF898, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF89A, 0x0300);
- r8168_mac_ocp_write(tp, 0xF89C, 0x051E);
- r8168_mac_ocp_write(tp, 0xF89E, 0xE434);
- r8168_mac_ocp_write(tp, 0xF8A0, 0xE018);
- r8168_mac_ocp_write(tp, 0xF8A2, 0xE092);
- r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20);
- r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0);
- r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F);
- r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4);
- r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3);
- r8168_mac_ocp_write(tp, 0xF8AE, 0xF007);
- r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0);
- r8168_mac_ocp_write(tp, 0xF8B2, 0xF103);
- r8168_mac_ocp_write(tp, 0xF8B4, 0xC607);
- r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00);
- r8168_mac_ocp_write(tp, 0xF8B8, 0xC606);
- r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00);
- r8168_mac_ocp_write(tp, 0xF8BC, 0xC602);
- r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00);
- r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C);
- r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28);
- r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C);
- r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00);
- r8168_mac_ocp_write(tp, 0xF8C8, 0xC707);
- r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00);
- r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2);
- r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1);
- r8168_mac_ocp_write(tp, 0xF8D0, 0xC502);
- r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA);
- r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0);
- r8168_mac_ocp_write(tp, 0xF8D8, 0xC502);
- r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00);
- r8168_mac_ocp_write(tp, 0xF8DC, 0x0132);
+ r8168_mac_ocp_write_seq(tp, recover_seq);

r8168_mac_ocp_write(tp, 0xFC26, 0x8000);

- r8168_mac_ocp_write(tp, 0xFC2A, 0x0743);
- r8168_mac_ocp_write(tp, 0xFC2C, 0x0801);
- r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9);
- r8168_mac_ocp_write(tp, 0xFC30, 0x02FD);
- r8168_mac_ocp_write(tp, 0xFC32, 0x0C25);
- r8168_mac_ocp_write(tp, 0xFC34, 0x00A9);
- r8168_mac_ocp_write(tp, 0xFC36, 0x012D);
+ r8168_mac_ocp_write_seq(tp, final_seq);
+
}

static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
--
2.34.1

2023-11-04 22:55:01

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention

On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
> The motivation for these helpers was the locking overhead of 130 consecutive
> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
> if the PHY is powered-down.
>
> To quote Heiner:
>
> On RTL8411b the RX unit gets confused if the PHY is powered-down.
> This was reported in [0] and confirmed by Realtek. Realtek provided
> a sequence to fix the RX unit after PHY wakeup.
>
> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
> spin_unlock_irqrestore().
>
> Each mac ocp write is made of:
>
> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
> u32 data)
> {
> if (rtl_ocp_reg_failure(reg))
> return;
>
> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
> }
>
> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
> u32 data)
> {
> unsigned long flags;
>
> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
> __r8168_mac_ocp_write(tp, reg, data);
> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
> }
>
> Register programming is done through RTL_W32() macro which expands into
>
> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>
> which is further (on Alpha):
>
> extern inline void writel(u32 b, volatile void __iomem *addr)
> {
> mb();
> __raw_writel(b, addr);
> }
>
> or on i386/x86_64:
>
> #define build_mmio_write(name, size, type, reg, barrier) \
> static inline void name(type val, volatile void __iomem *addr) \
> { asm volatile("mov" size " %0,%1": :reg (val), \
> "m" (*(volatile type __force *)addr) barrier); }
>
> build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>
> This obviously involves iat least a compiler barrier.
>
> mb() expands into something like this i.e. on x86_64:
>
> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>
> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
> memory barrier, writel(), and spin_unlock_irqrestore().
>
> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
> a lock storm that will stall all of the cores and CPUs on the same memory controller
> for certain time I/O takes to finish.
>
> In a sequential case of RTL register programming, the writes to RTL registers
> can be coalesced under a same raw spinlock. This can dramatically decrease the
> number of bus stalls in a multicore or multi-CPU system.
>
> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
> provided to reduce lock contention:
>
> static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
> {
>
> ...
>
> /* The following Realtek-provided magic fixes an issue with the RX unit
> * getting confused after the PHY having been powered-down.
> */
>
> static const struct recover_8411b_info init_zero_seq[] = {
> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
> ...
> };
>
> ...
>
> r8168_mac_ocp_write_seq(tp, init_zero_seq);
>
> ...
>
> }
>
> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
> functions that only changed the function names and the ending of the line, so the actual
> hex data is unchanged.
>
> To repeat, the reason for the introduction of the original commit
> was to enable recovery of the RX unit on the RTL8411b which was confused by the
> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
> into a series of about 500+ memory bus locks, most waiting for the main memory read,
> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
> the programming sequence to reach RTL NIC registers.
>
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>
> v6:
> proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
> the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>
> v5:
> attempted some new optimisations, which were rejected, but not all and not completely.
>
> v4:
> fixed complaints as advised by Heiner and checkpatch.pl.
> split the patch into five sections to be more easily manipulated and reviewed
> introduced r8168_mac_ocp_write_seq()
> applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>
> v3:
> removed register/mask pair array sentinels, so using ARRAY_SIZE().
> avoided duplication of RTL_W32() call code as advised by Heiner.
>
> Mirsad Goran Todorovac (5):
> r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
> stalls
> r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
> spinlock stalls
> r8169: Coalesce mac ocp write and modify for 8168H start to reduce
> spinlocks
> r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
> spinlock contention
> r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
> spinlocks
>
> drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
> 1 file changed, 150 insertions(+), 154 deletions(-)
>

You still write:
"a lock storm that will stall all of the cores and CPUs on the same memory controller"
even though you were informed that that's not the case.
There's no actual problem, therefore your Fixes tags are incorrect.
Also net-next is closed at the moment.
In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
patch 2 is worth adding all the helpers in patch 1.

2023-11-05 00:16:51

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention



On 11/4/23 23:37, Heiner Kallweit wrote:
> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
>> The motivation for these helpers was the locking overhead of 130 consecutive
>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
>> if the PHY is powered-down.
>>
>> To quote Heiner:
>>
>> On RTL8411b the RX unit gets confused if the PHY is powered-down.
>> This was reported in [0] and confirmed by Realtek. Realtek provided
>> a sequence to fix the RX unit after PHY wakeup.
>>
>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
>> spin_unlock_irqrestore().
>>
>> Each mac ocp write is made of:
>>
>> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>> u32 data)
>> {
>> if (rtl_ocp_reg_failure(reg))
>> return;
>>
>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>> }
>>
>> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>> u32 data)
>> {
>> unsigned long flags;
>>
>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>> __r8168_mac_ocp_write(tp, reg, data);
>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>> }
>>
>> Register programming is done through RTL_W32() macro which expands into
>>
>> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>
>> which is further (on Alpha):
>>
>> extern inline void writel(u32 b, volatile void __iomem *addr)
>> {
>> mb();
>> __raw_writel(b, addr);
>> }
>>
>> or on i386/x86_64:
>>
>> #define build_mmio_write(name, size, type, reg, barrier) \
>> static inline void name(type val, volatile void __iomem *addr) \
>> { asm volatile("mov" size " %0,%1": :reg (val), \
>> "m" (*(volatile type __force *)addr) barrier); }
>>
>> build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>
>> This obviously involves iat least a compiler barrier.
>>
>> mb() expands into something like this i.e. on x86_64:
>>
>> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>
>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
>> memory barrier, writel(), and spin_unlock_irqrestore().
>>
>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>> for certain time I/O takes to finish.
>>
>> In a sequential case of RTL register programming, the writes to RTL registers
>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>> number of bus stalls in a multicore or multi-CPU system.
>>
>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
>> provided to reduce lock contention:
>>
>> static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>> {
>>
>> ...
>>
>> /* The following Realtek-provided magic fixes an issue with the RX unit
>> * getting confused after the PHY having been powered-down.
>> */
>>
>> static const struct recover_8411b_info init_zero_seq[] = {
>> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>> ...
>> };
>>
>> ...
>>
>> r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>
>> ...
>>
>> }
>>
>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>> functions that only changed the function names and the ending of the line, so the actual
>> hex data is unchanged.
>>
>> To repeat, the reason for the introduction of the original commit
>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>> the programming sequence to reach RTL NIC registers.
>>
>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>
>> v6:
>> proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>> the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>>
>> v5:
>> attempted some new optimisations, which were rejected, but not all and not completely.
>>
>> v4:
>> fixed complaints as advised by Heiner and checkpatch.pl.
>> split the patch into five sections to be more easily manipulated and reviewed
>> introduced r8168_mac_ocp_write_seq()
>> applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>
>> v3:
>> removed register/mask pair array sentinels, so using ARRAY_SIZE().
>> avoided duplication of RTL_W32() call code as advised by Heiner.
>>
>> Mirsad Goran Todorovac (5):
>> r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>> stalls
>> r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>> spinlock stalls
>> r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>> spinlocks
>> r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>> spinlock contention
>> r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>> spinlocks
>>
>> drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>> 1 file changed, 150 insertions(+), 154 deletions(-)
>>

Hi, Mr. Kallweit,

So good to hear so soon from you. I'm encouraged that you are positive about improving
the speed and reducing the size of the Realtek drivers.

> You still write:
> "a lock storm that will stall all of the cores and CPUs on the same memory controller"
> even though you were informed that that's not the case.

I was not convinced. There is no such thing as a free lunch, and there is no locking
without affecting other cores, or locking would not make sense.

> There's no actual problem, therefore your Fixes tags are incorrect.

Mea culpa - my mistake, I will fix that in the next version.

> Also net-next is closed at the moment.

There is no problem with that, as these are only optimisation fixes, not zero day
exploits. I am a patient person.

> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
> patch 2 is worth adding all the helpers in patch 1.

I merely followed and mimed driver style from the constructions like this one:

static const struct ephy_info e_info_8168e_1[] = {
{ 0x00, 0x0200, 0x0100 },
{ 0x00, 0x0000, 0x0004 },
{ 0x06, 0x0002, 0x0001 },
{ 0x06, 0x0000, 0x0030 },
{ 0x07, 0x0000, 0x2000 },
{ 0x00, 0x0000, 0x0020 },
{ 0x03, 0x5800, 0x2000 },
{ 0x03, 0x0000, 0x0001 },
{ 0x01, 0x0800, 0x1000 },
{ 0x07, 0x0000, 0x4000 },
{ 0x1e, 0x0000, 0x2000 },
{ 0x19, 0xffff, 0xfe6c },
{ 0x0a, 0x0000, 0x0040 }
};

rtl_set_def_aspm_entry_latency(tp);

rtl_ephy_init(tp, e_info_8168e_1);

Here you did not think that introducing an array reduced code readability.

My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock
on each RTL_W32() write. I am convinced that a driver with less
raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better
with more NICs and more cores.

You said nothing to convinced me otherwise.

But I am merely defending my point, this by no means implies disrespect or overlooking
your contribution to the source as a coder and a a maintainer.

Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous,
and it is logical to use less locking, as locking is expensive. "barrier" in writev()
guarantees sequential orders of write, and locking and unlocking on each read/modify/write
is unnecessary overhead, IMHO.

As the conclusion, I would like to emphasise that improving lock contention for the code
is by no means a personal attack on the maintainer or a breach of the Code of Conduct.

If you are so much against the changes which Mr. Jacob Keller from Intel reviewed,
maybe we can cool emotions and start thinking rationally.

Additionally, I would like to "inline" many functions, as I think that call/return
sequences with stack frame generation /destruction are more expensive than inlining the
small one liners.

But I will certainly respect your opinion on the matter as a maintainer.

What I realise that I might be optimising the cold paths of the code, but from your emails
it seems like nothing is worth optimising in this driver, and with all due respect Sir,
I think that is dead wrong.

Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded
in the spirit that this is exactly what the guys in Chernobyl did while maintaining the
reactor that malfunctioned: they did not dare to question the authority telling them that
everything is alright.

Have a nice evening, and please do not take these words as a breach of the Code or a
personal attack. I believe we are on the same side, and that is making this driver better.

The Linux kernel developer community was my last hope that this human race has a force
to improve the mankind and make it worth surviving.

But sometimes it is more honourable to go down with the ship and preserve the honour.

Best regards,
Mirsad Todorovac

2023-11-05 02:45:40

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention



On 11/4/23 23:37, Heiner Kallweit wrote:
> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
>> The motivation for these helpers was the locking overhead of 130 consecutive
>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
>> if the PHY is powered-down.
>>
>> To quote Heiner:
>>
>> On RTL8411b the RX unit gets confused if the PHY is powered-down.
>> This was reported in [0] and confirmed by Realtek. Realtek provided
>> a sequence to fix the RX unit after PHY wakeup.
>>
>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
>> spin_unlock_irqrestore().
>>
>> Each mac ocp write is made of:
>>
>> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>> u32 data)
>> {
>> if (rtl_ocp_reg_failure(reg))
>> return;
>>
>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>> }
>>
>> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>> u32 data)
>> {
>> unsigned long flags;
>>
>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>> __r8168_mac_ocp_write(tp, reg, data);
>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>> }
>>
>> Register programming is done through RTL_W32() macro which expands into
>>
>> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>
>> which is further (on Alpha):
>>
>> extern inline void writel(u32 b, volatile void __iomem *addr)
>> {
>> mb();
>> __raw_writel(b, addr);
>> }
>>
>> or on i386/x86_64:
>>
>> #define build_mmio_write(name, size, type, reg, barrier) \
>> static inline void name(type val, volatile void __iomem *addr) \
>> { asm volatile("mov" size " %0,%1": :reg (val), \
>> "m" (*(volatile type __force *)addr) barrier); }
>>
>> build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>
>> This obviously involves iat least a compiler barrier.
>>
>> mb() expands into something like this i.e. on x86_64:
>>
>> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>
>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
>> memory barrier, writel(), and spin_unlock_irqrestore().
>>
>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>> for certain time I/O takes to finish.
>>
>> In a sequential case of RTL register programming, the writes to RTL registers
>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>> number of bus stalls in a multicore or multi-CPU system.
>>
>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
>> provided to reduce lock contention:
>>
>> static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>> {
>>
>> ...
>>
>> /* The following Realtek-provided magic fixes an issue with the RX unit
>> * getting confused after the PHY having been powered-down.
>> */
>>
>> static const struct recover_8411b_info init_zero_seq[] = {
>> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>> ...
>> };
>>
>> ...
>>
>> r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>
>> ...
>>
>> }
>>
>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>> functions that only changed the function names and the ending of the line, so the actual
>> hex data is unchanged.
>>
>> To repeat, the reason for the introduction of the original commit
>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>> the programming sequence to reach RTL NIC registers.
>>
>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>
>> v6:
>> proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>> the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>>
>> v5:
>> attempted some new optimisations, which were rejected, but not all and not completely.
>>
>> v4:
>> fixed complaints as advised by Heiner and checkpatch.pl.
>> split the patch into five sections to be more easily manipulated and reviewed
>> introduced r8168_mac_ocp_write_seq()
>> applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>
>> v3:
>> removed register/mask pair array sentinels, so using ARRAY_SIZE().
>> avoided duplication of RTL_W32() call code as advised by Heiner.
>>
>> Mirsad Goran Todorovac (5):
>> r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>> stalls
>> r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>> spinlock stalls
>> r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>> spinlocks
>> r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>> spinlock contention
>> r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>> spinlocks
>>
>> drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>> 1 file changed, 150 insertions(+), 154 deletions(-)
>>
>
> You still write:
> "a lock storm that will stall all of the cores and CPUs on the same memory controller"
> even though you were informed that that's not the case.
> There's no actual problem, therefore your Fixes tags are incorrect.
> Also net-next is closed at the moment.
> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
> patch 2 is worth adding all the helpers in patch 1.

After some thought, I would like to have a consensus on these patches, rather than someone
feels defeated or outvoted.

So I will try to reach some common ground, if you think the cause is worth it.

Why is adding six lines of a helper a problem worse than removing 130 lines of callers?

I would hate to think that the Linux kernel developer community became the place where
Authority has higher weight than Reason and Logic.

I have no personal gain from improving these drivers other than the Galactic credits.

One thing I wouldn't like and do not like is the Windows drivers being better because
their programmers are more innovative.

Best regards,
Mirsad Todorovac

2023-11-05 09:21:17

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention

On 05.11.2023 01:15, Mirsad Todorovac wrote:
>
>
> On 11/4/23 23:37, Heiner Kallweit wrote:
>> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
>>> The motivation for these helpers was the locking overhead of 130 consecutive
>>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
>>> if the PHY is powered-down.
>>>
>>> To quote Heiner:
>>>
>>>      On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>>      This was reported in [0] and confirmed by Realtek. Realtek provided
>>>      a sequence to fix the RX unit after PHY wakeup.
>>>
>>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
>>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
>>> spin_unlock_irqrestore().
>>>
>>> Each mac ocp write is made of:
>>>
>>>      static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>                        u32 data)
>>>      {
>>>          if (rtl_ocp_reg_failure(reg))
>>>              return;
>>>
>>>          RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>      }
>>>
>>>      static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>                      u32 data)
>>>      {
>>>          unsigned long flags;
>>>
>>>          raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>          __r8168_mac_ocp_write(tp, reg, data);
>>>          raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>      }
>>>
>>> Register programming is done through RTL_W32() macro which expands into
>>>
>>>      #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>
>>> which is further (on Alpha):
>>>
>>>      extern inline void writel(u32 b, volatile void __iomem *addr)
>>>      {
>>>          mb();
>>>          __raw_writel(b, addr);
>>>      }
>>>
>>> or on i386/x86_64:
>>>
>>>      #define build_mmio_write(name, size, type, reg, barrier) \
>>>      static inline void name(type val, volatile void __iomem *addr) \
>>>      { asm volatile("mov" size " %0,%1": :reg (val), \
>>>      "m" (*(volatile type __force *)addr) barrier); }
>>>
>>>      build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>
>>> This obviously involves iat least a compiler barrier.
>>>
>>> mb() expands into something like this i.e. on x86_64:
>>>
>>>      #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>
>>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>
>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>>> for certain time I/O takes to finish.
>>>
>>> In a sequential case of RTL register programming, the writes to RTL registers
>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>> number of bus stalls in a multicore or multi-CPU system.
>>>
>>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
>>> provided to reduce lock contention:
>>>
>>>      static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>      {
>>>
>>>          ...
>>>
>>>          /* The following Realtek-provided magic fixes an issue with the RX unit
>>>           * getting confused after the PHY having been powered-down.
>>>           */
>>>
>>>          static const struct recover_8411b_info init_zero_seq[] = {
>>>              { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>>>              ...
>>>          };
>>>
>>>          ...
>>>
>>>          r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>
>>>          ...
>>>
>>>      }
>>>
>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>> functions that only changed the function names and the ending of the line, so the actual
>>> hex data is unchanged.
>>>
>>> To repeat, the reason for the introduction of the original commit
>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>>> the programming sequence to reach RTL NIC registers.
>>>
>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>>
>>> v6:
>>>   proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>>>   the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>>>
>>> v5:
>>>   attempted some new optimisations, which were rejected, but not all and not completely.
>>>
>>> v4:
>>>   fixed complaints as advised by Heiner and checkpatch.pl.
>>>   split the patch into five sections to be more easily manipulated and reviewed
>>>   introduced r8168_mac_ocp_write_seq()
>>>   applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>>
>>> v3:
>>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>>
>>> Mirsad Goran Todorovac (5):
>>>    r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>>>      stalls
>>>    r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>>>      spinlock stalls
>>>    r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>>>      spinlocks
>>>    r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>>>      spinlock contention
>>>    r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>>>      spinlocks
>>>
>>>   drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>>>   1 file changed, 150 insertions(+), 154 deletions(-)
>>>
>
> Hi, Mr. Kallweit,
>
> So good to hear so soon from you. I'm encouraged that you are positive about improving
> the speed and reducing the size of the Realtek drivers.
>
>> You still write:
>> "a lock storm that will stall all of the cores and CPUs on the same memory controller"
>> even though you were informed that that's not the case.
>
> I was not convinced. There is no such thing as a free lunch, and there is no locking
> without affecting other cores, or locking would not make sense.
>
>> There's no actual problem, therefore your Fixes tags are incorrect.
>
> Mea culpa - my mistake, I will fix that in the next version.
>
>> Also net-next is closed at the moment.
>
> There is no problem with that, as these are only optimisation fixes, not zero day
> exploits. I am a patient person.
>
>> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
>> patch 2 is worth adding all the helpers in patch 1.
>
> I merely followed and mimed driver style from the constructions like this one:
>
>         static const struct ephy_info e_info_8168e_1[] = {
>                 { 0x00, 0x0200, 0x0100 },
>                 { 0x00, 0x0000, 0x0004 },
>                 { 0x06, 0x0002, 0x0001 },
>                 { 0x06, 0x0000, 0x0030 },
>                 { 0x07, 0x0000, 0x2000 },
>                 { 0x00, 0x0000, 0x0020 },
>                 { 0x03, 0x5800, 0x2000 },
>                 { 0x03, 0x0000, 0x0001 },
>                 { 0x01, 0x0800, 0x1000 },
>                 { 0x07, 0x0000, 0x4000 },
>                 { 0x1e, 0x0000, 0x2000 },
>                 { 0x19, 0xffff, 0xfe6c },
>                 { 0x0a, 0x0000, 0x0040 }
>         };
>
>         rtl_set_def_aspm_entry_latency(tp);
>
>         rtl_ephy_init(tp, e_info_8168e_1);
>
> Here you did not think that introducing an array reduced code readability.
>
> My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock
> on each RTL_W32() write. I am convinced that a driver with less
> raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better
> with more NICs and more cores.
>
Then please focus on hot paths where it actually could make a difference,
and provide numbers instead of a purely theoretical discussion.

> You said nothing to convinced me otherwise.
>
> But I am merely defending my point, this by no means implies disrespect or overlooking
> your contribution to the source as a coder and a a maintainer.
>
> Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous,
> and it is logical to use less locking, as locking is expensive. "barrier" in writev()
> guarantees sequential orders of write, and locking and unlocking on each read/modify/write
> is unnecessary overhead, IMHO.
>
> As the conclusion, I would like to emphasise that improving lock contention for the code
> is by no means a personal attack on the maintainer or a breach of the Code of Conduct.
>
> If you are so much against the changes which Mr. Jacob Keller from Intel reviewed,
> maybe we can cool emotions and start thinking rationally.
>
> Additionally, I would like to "inline" many functions, as I think that call/return
> sequences with stack frame generation /destruction are more expensive than inlining the
> small one liners.
>

Mainline standard is to let the compiler decide on inlining.

> But I will certainly respect your opinion on the matter as a maintainer.
>
> What I realise that I might be optimising the cold paths of the code, but from your emails
> it seems like nothing is worth optimising in this driver, and with all due respect Sir,
> I think that is dead wrong.
>

Nobody ever said that, and if you look at the history of the driver you'll see a lot of
optimizations that have been added over time. Ideally an optimization improves both:
performance and code readability
Code readability is important for maintainability and weighs higher for me than a minor
performance optimization in a code path that is very rarely used.

> Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded
> in the spirit that this is exactly what the guys in Chernobyl did while maintaining the
> reactor that malfunctioned: they did not dare to question the authority telling them that
> everything is alright.
>
> Have a nice evening, and please do not take these words as a breach of the Code or a
> personal attack. I believe we are on the same side, and that is making this driver better.
>
> The Linux kernel developer community was my last hope that this human race has a force
> to improve the mankind and make it worth surviving.
>
> But sometimes it is more honourable to go down with the ship and preserve the honour.
>
> Best regards,
> Mirsad Todorovac

2023-11-05 15:01:40

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention

On 11/5/23 10:20, Heiner Kallweit wrote:
> On 05.11.2023 01:15, Mirsad Todorovac wrote:
>>
>>
>> On 11/4/23 23:37, Heiner Kallweit wrote:
>>> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
>>>> The motivation for these helpers was the locking overhead of 130 consecutive
>>>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
>>>> if the PHY is powered-down.
>>>>
>>>> To quote Heiner:
>>>>
>>>>      On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>>>      This was reported in [0] and confirmed by Realtek. Realtek provided
>>>>      a sequence to fix the RX unit after PHY wakeup.
>>>>
>>>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
>>>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
>>>> spin_unlock_irqrestore().
>>>>
>>>> Each mac ocp write is made of:
>>>>
>>>>      static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>>                        u32 data)
>>>>      {
>>>>          if (rtl_ocp_reg_failure(reg))
>>>>              return;
>>>>
>>>>          RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>>      }
>>>>
>>>>      static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>>                      u32 data)
>>>>      {
>>>>          unsigned long flags;
>>>>
>>>>          raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>          __r8168_mac_ocp_write(tp, reg, data);
>>>>          raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>      }
>>>>
>>>> Register programming is done through RTL_W32() macro which expands into
>>>>
>>>>      #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>>
>>>> which is further (on Alpha):
>>>>
>>>>      extern inline void writel(u32 b, volatile void __iomem *addr)
>>>>      {
>>>>          mb();
>>>>          __raw_writel(b, addr);
>>>>      }
>>>>
>>>> or on i386/x86_64:
>>>>
>>>>      #define build_mmio_write(name, size, type, reg, barrier) \
>>>>      static inline void name(type val, volatile void __iomem *addr) \
>>>>      { asm volatile("mov" size " %0,%1": :reg (val), \
>>>>      "m" (*(volatile type __force *)addr) barrier); }
>>>>
>>>>      build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>>
>>>> This obviously involves iat least a compiler barrier.
>>>>
>>>> mb() expands into something like this i.e. on x86_64:
>>>>
>>>>      #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>>
>>>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
>>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>>
>>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>>>> for certain time I/O takes to finish.
>>>>
>>>> In a sequential case of RTL register programming, the writes to RTL registers
>>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>>> number of bus stalls in a multicore or multi-CPU system.
>>>>
>>>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
>>>> provided to reduce lock contention:
>>>>
>>>>      static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>>      {
>>>>
>>>>          ...
>>>>
>>>>          /* The following Realtek-provided magic fixes an issue with the RX unit
>>>>           * getting confused after the PHY having been powered-down.
>>>>           */
>>>>
>>>>          static const struct recover_8411b_info init_zero_seq[] = {
>>>>              { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>>>>              ...
>>>>          };
>>>>
>>>>          ...
>>>>
>>>>          r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>>
>>>>          ...
>>>>
>>>>      }
>>>>
>>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>>> functions that only changed the function names and the ending of the line, so the actual
>>>> hex data is unchanged.
>>>>
>>>> To repeat, the reason for the introduction of the original commit
>>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>>>> the programming sequence to reach RTL NIC registers.
>>>>
>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>>>
>>>> v6:
>>>>   proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>>>>   the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>>>>
>>>> v5:
>>>>   attempted some new optimisations, which were rejected, but not all and not completely.
>>>>
>>>> v4:
>>>>   fixed complaints as advised by Heiner and checkpatch.pl.
>>>>   split the patch into five sections to be more easily manipulated and reviewed
>>>>   introduced r8168_mac_ocp_write_seq()
>>>>   applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>>>
>>>> v3:
>>>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>>>
>>>> Mirsad Goran Todorovac (5):
>>>>    r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>>>>      stalls
>>>>    r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>>>>      spinlock stalls
>>>>    r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>>>>      spinlocks
>>>>    r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>>>>      spinlock contention
>>>>    r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>>>>      spinlocks
>>>>
>>>>   drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>>>>   1 file changed, 150 insertions(+), 154 deletions(-)
>>>>
>>
>> Hi, Mr. Kallweit,
>>
>> So good to hear so soon from you. I'm encouraged that you are positive about improving
>> the speed and reducing the size of the Realtek drivers.
>>
>>> You still write:
>>> "a lock storm that will stall all of the cores and CPUs on the same memory controller"
>>> even though you were informed that that's not the case.
>>
>> I was not convinced. There is no such thing as a free lunch, and there is no locking
>> without affecting other cores, or locking would not make sense.
>>
>>> There's no actual problem, therefore your Fixes tags are incorrect.
>>
>> Mea culpa - my mistake, I will fix that in the next version.
>>
>>> Also net-next is closed at the moment.
>>
>> There is no problem with that, as these are only optimisation fixes, not zero day
>> exploits. I am a patient person.
>>
>>> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
>>> patch 2 is worth adding all the helpers in patch 1.
>>
>> I merely followed and mimed driver style from the constructions like this one:
>>
>>         static const struct ephy_info e_info_8168e_1[] = {
>>                 { 0x00, 0x0200, 0x0100 },
>>                 { 0x00, 0x0000, 0x0004 },
>>                 { 0x06, 0x0002, 0x0001 },
>>                 { 0x06, 0x0000, 0x0030 },
>>                 { 0x07, 0x0000, 0x2000 },
>>                 { 0x00, 0x0000, 0x0020 },
>>                 { 0x03, 0x5800, 0x2000 },
>>                 { 0x03, 0x0000, 0x0001 },
>>                 { 0x01, 0x0800, 0x1000 },
>>                 { 0x07, 0x0000, 0x4000 },
>>                 { 0x1e, 0x0000, 0x2000 },
>>                 { 0x19, 0xffff, 0xfe6c },
>>                 { 0x0a, 0x0000, 0x0040 }
>>         };
>>
>>         rtl_set_def_aspm_entry_latency(tp);
>>
>>         rtl_ephy_init(tp, e_info_8168e_1);
>>
>> Here you did not think that introducing an array reduced code readability.
>>
>> My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock
>> on each RTL_W32() write. I am convinced that a driver with less
>> raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better
>> with more NICs and more cores.
>>
> Then please focus on hot paths where it actually could make a difference,
> and provide numbers instead of a purely theoretical discussion.

I will comply.

RTL8411b losing PHY that requires this expensive reset probably doesn't happen
anyway on the Linux servers. :-/

I have done my homework and I see that you are also co-maintainer of the net PHYLIB,
so your insight on this matter is undoubtedly greater after five years of experience
in maintaining the driver.

Learning about the network stack and the PHY layer is however a formidable thought
very interesting task. The whole area of making multimedia more responsive on Linux
and Windows graphic interface is very challenging, and I could pass it with my day
job as research.

But as I said, I have to catch up with a lot of homework.

>> You said nothing to convinced me otherwise.
>>
>> But I am merely defending my point, this by no means implies disrespect or overlooking
>> your contribution to the source as a coder and a a maintainer.
>>
>> Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous,
>> and it is logical to use less locking, as locking is expensive. "barrier" in writev()
>> guarantees sequential orders of write, and locking and unlocking on each read/modify/write
>> is unnecessary overhead, IMHO.
>>
>> As the conclusion, I would like to emphasise that improving lock contention for the code
>> is by no means a personal attack on the maintainer or a breach of the Code of Conduct.
>>
>> If you are so much against the changes which Mr. Jacob Keller from Intel reviewed,
>> maybe we can cool emotions and start thinking rationally.
>>
>> Additionally, I would like to "inline" many functions, as I think that call/return
>> sequences with stack frame generation /destruction are more expensive than inlining the
>> small one liners.

> Mainline standard is to let the compiler decide on inlining.

>> But I will certainly respect your opinion on the matter as a maintainer.
>>
>> What I realise that I might be optimising the cold paths of the code, but from your emails
>> it seems like nothing is worth optimising in this driver, and with all due respect Sir,
>> I think that is dead wrong.
>
> Nobody ever said that, and if you look at the history of the driver you'll see a lot of
> optimizations that have been added over time. Ideally an optimization improves both:
> performance and code readability
> Code readability is important for maintainability and weighs higher for me than a minor
> performance optimization in a code path that is very rarely used.

I see.

However, you do use lookup tables for programming with fn rtl_ephy_init().

If this would be more readable, I can unroll the table so it is one entry per line
like e_info_8168e_1 was made?

Then the actual function call adds nothing to readability and the ease of maintanance,
as the principle { address, value } and { address, mask, value } would be preserved.

In fact, some programming books advise separating data from code for readability and
efficiency sake.

I admit that the 8125 optimisation I proposed is minimal locking but hard to read and
maintain. (It defies the KISS principle.)

Thank you for your time and patience with me.

I always like more that things are explained through Spock logic than by a call to authority
(which is BTW the argumentum-ad-hominem logical fallacy).

(This is of course not the case with the sacred texts, where I use quotes from the relevant
authorities.)

Have a nice day and I wish you a blessed Sunday.

I should probably catch up with the documentation before using any more of your valuable time
and energy. I hope to reciprocate.

Best regards,
Mirsad Todorovac

>> Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded
>> in the spirit that this is exactly what the guys in Chernobyl did while maintaining the
>> reactor that malfunctioned: they did not dare to question the authority telling them that
>> everything is alright.
>>
>> Have a nice evening, and please do not take these words as a breach of the Code or a
>> personal attack. I believe we are on the same side, and that is making this driver better.
>>
>> The Linux kernel developer community was my last hope that this human race has a force
>> to improve the mankind and make it worth surviving.
>>
>> But sometimes it is more honourable to go down with the ship and preserve the honour.
>>
>> Best regards,
>> Mirsad Todorovac

2023-11-05 15:40:08

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention

> > > With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
> > > a lock storm that will stall all of the cores and CPUs on the same memory controller
> > > for certain time I/O takes to finish.

Please provide benchmark data to show this is a real issue, and the
patch fixes it.

> Additionally, I would like to "inline" many functions, as I think that call/return
> sequences with stack frame generation /destruction are more expensive than inlining the
> small one liners.

Please provide benchmarks to show the compiler is getting this wrong,
and inline really is needed.

Until there are benchmarks: NACK.

Andrew

2023-11-05 19:28:22

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention

On 11/5/23 16:33, Andrew Lunn wrote:

>>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>>>> for certain time I/O takes to finish.

> Please provide benchmark data to show this is a real issue, and the
> patch fixes it.

Certainly, Sir, but I have to figure out what to measure.

To better think of it, I actually do not have a system with a physical RTL8411b, this patch
was made by the finding in the visual inspection of the code.

FWIW, separation of code and data is the design principle that is strongly endorsed lately
and it seems like a good practice that prevents a range of security breaches and attacks:

[1] https://blog.klipse.tech/databook/2020/10/02/separate-code-data.html

[2] Reduce system complexity by separating Code from Data,
https://livebook.manning.com/book/data-oriented-programming/chapter-2/v-2/

In this case, data is hard-coded.

The resulting code would be smaller in size and execution time, and IMHO more readable,
(in a table), but I will not enter much discussion for you have made your mind already.

>> Additionally, I would like to "inline" many functions, as I think that call/return
>> sequences with stack frame generation /destruction are more expensive than inlining the
>> small one liners.
>
> Please provide benchmarks to show the compiler is getting this wrong,
> and inline really is needed.

I think I am by now technologically up to that:

"drivers/net/ethernet/realtek/r8169_main.s" 302034 lines
-------------------------------------------------------------------------------------
7500 r8168_mac_ocp_write:
7501 .LVL488:
7502 .LFB6322:
7503 .loc 1 900 1 is_stmt 1 view -0
7504 .cfi_startproc
7505 1: call __fentry__
7506 .section __mcount_loc, "a",@progbits
7507 .quad 1b
7508 .previous
7509 .loc 1 901 2 view .LVU1955
7510 .loc 1 903 2 view .LVU1956
7511 .loc 1 903 7 view .LVU1957
7512 .loc 1 903 10 view .LVU1958
7513 .loc 1 903 33 view .LVU1959
7514 .loc 1 903 57 view .LVU1960
7515 .loc 1 903 88 view .LVU1961
7516 .loc 1 903 95 view .LVU1962
7517 .loc 1 900 1 is_stmt 0 view .LVU1963
7518 pushq %rbp
7519 .cfi_def_cfa_offset 16
7520 .cfi_offset 6, -16
7521 movq %rsp, %rbp
7522 .cfi_def_cfa_register 6
7523 pushq %r15
7524 .cfi_offset 15, -24
7525 .loc 1 903 103 view .LVU1964
7526 leaq 6692(%rdi), %r15
7527 .loc 1 900 1 view .LVU1965
7528 pushq %r14
7529 .cfi_offset 14, -32
7530 movl %edx, %r14d
7531 pushq %r13
7532 pushq %r12
7533 .cfi_offset 13, -40
7534 .cfi_offset 12, -48
7535 movq %rdi, %r12
7536 .loc 1 903 103 view .LVU1966
7537 movq %r15, %rdi
7538 .LVL489:
7539 .loc 1 900 1 view .LVU1967
7540 pushq %rbx
7541 .cfi_offset 3, -56
7542 .loc 1 900 1 view .LVU1968
7543 movl %esi, %ebx
7544 .loc 1 903 103 view .LVU1969
7545 call _raw_spin_lock_irqsave
7546 .LVL490:
7547 .LBB3557:
7548 .LBB3558:
7549 .loc 1 893 6 view .LVU1970
7550 movl %ebx, %edi
7551 .LBE3558:
7552 .LBE3557:
7553 .loc 1 903 103 view .LVU1971
7554 movq %rax, %r13
7555 .LVL491:
7556 .loc 1 903 5 is_stmt 1 view .LVU1972
7557 .loc 1 904 2 view .LVU1973
7558 .LBB3564:
7559 .LBI3557:
7560 .loc 1 891 13 view .LVU1974
7561 .LBB3563:
7562 .loc 1 893 2 view .LVU1975
7563 .loc 1 893 6 is_stmt 0 view .LVU1976
7564 call rtl_ocp_reg_failure
7565 .LVL492:
7566 .loc 1 893 5 view .LVU1977
7567 testb %al, %al
7568 jne .L375
7569 .LVL493:
7570 .LBB3559:
7571 .LBI3559:
7572 .loc 1 891 13 is_stmt 1 view .LVU1978
7573 .LBB3560:
7574 .loc 1 896 2 view .LVU1979
7575 .loc 1 896 28 is_stmt 0 view .LVU1980
7576 sall $15, %ebx
7577 .LVL494:
7578 .loc 1 896 58 view .LVU1981
7579 movq (%r12), %rax
7580 .loc 1 896 35 view .LVU1982
7581 orl %r14d, %ebx
7582 .loc 1 896 2 view .LVU1983
7583 orl $-2147483648, %ebx
7584 .LVL495:
7585 .LBB3561:
7586 .LBI3561:
7587 .loc 2 66 120 is_stmt 1 view .LVU1984
7588 .LBB3562:
7589 .loc 2 66 168 view .LVU1985
7590 #APP
7591 # 66 "./arch/x86/include/asm/io.h" 1
7592 movl %ebx,176(%rax)
7593 # 0 "" 2
7594 .LVL496:
7595 #NO_APP
7596 .L375:
7597 .loc 2 66 168 is_stmt 0 view .LVU1986
7598 .LBE3562:
7599 .LBE3561:
7600 .LBE3560:
7601 .LBE3559:
7602 .LBE3563:
7603 .LBE3564:
7604 .loc 1 905 2 is_stmt 1 view .LVU1987
7605 .loc 1 905 7 view .LVU1988
7606 .loc 1 905 10 view .LVU1989
7607 .loc 1 905 33 view .LVU1990
7608 .loc 1 905 57 view .LVU1991
7609 .loc 1 905 88 view .LVU1992
7610 .loc 1 905 95 view .LVU1993
7611 movq %r13, %rsi
7612 movq %r15, %rdi
7613 call _raw_spin_unlock_irqrestore
7614 .LVL497:
7615 .loc 1 905 5 view .LVU1994
7616 .loc 1 906 1 is_stmt 0 view .LVU1995
7617 popq %rbx
7618 .cfi_restore 3
7619 popq %r12
7620 .cfi_restore 12
7621 .LVL498:
7622 .loc 1 906 1 view .LVU1996
7623 popq %r13
7624 .cfi_restore 13
7625 .LVL499:
7626 .loc 1 906 1 view .LVU1997
7627 popq %r14
7628 .cfi_restore 14
7629 .LVL500:
7630 .loc 1 906 1 view .LVU1998
7631 popq %r15
7632 .cfi_restore 15
7633 .LVL501:
7634 .loc 1 906 1 view .LVU1999
7635 popq %rbp
7636 .cfi_restore 6
7637 .cfi_def_cfa 7, 8
7638 xorl %eax, %eax
7639 xorl %edx, %edx
7640 xorl %esi, %esi
7641 xorl %edi, %edi
7642 jmp __x86_return_thunk
7643 .cfi_endproc
7644 .LFE6322:
7645 .size r8168_mac_ocp_write, .-r8168_mac_ocp_write
7646 .p2align 4
7647 .section __patchable_function_entries,"awo",@progbits,rtl_eriar_cond_check
7648 .align 8
7649 .quad .LPFE44
7650 .text
7651 .LPFE44:
7652 nop
7653 nop
7654 nop
7655 nop
-------------------------------------------------------------------------------------

The call of the function is the actual call:

-------------------------------------------------------------------------------------
39334 .LBE11119:
39335 .loc 1 3112 2 is_stmt 1 view .LVU10399
39336 xorl %edx, %edx
39337 movl $64556, %esi
39338 movq %r13, %rdi
39339 call r8168_mac_ocp_write
-------------------------------------------------------------------------------------

The command used for generating the assembly was taken from .o.cmd file and
added -save-temps as the only change:

$ gcc -Wp,-MMD,drivers/net/ethernet/realtek/.r8169_main.o.d -save-temps -nostdinc \
-I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi \
-I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi \
-include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h \
-include ./include/linux/compiler_types.h -D__KERNEL__ -fmacro-prefix-map=./= \
-std=gnu11 -fshort-wchar -funsigned-char -fno-common -fno-PIE -fno-strict-aliasing \
-mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -falign-jumps=1 \
-falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup \
-mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables \
-mindirect-branch=thunk-extern -mindirect-branch-register -mindirect-branch-cs-prefix \
-mfunction-return=thunk-extern -fno-jump-tables -mharden-sls=all -fpatchable-function-entry=16,16 \
-fno-delete-null-pointer-checks -O2 -fno-allow-store-data-races -fstack-protector-strong \
-fno-omit-frame-pointer -fno-optimize-sibling-calls -fno-stack-clash-protection \
-fzero-call-used-regs=used-gpr -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -falign-functions=16 \
-fno-strict-overflow -fno-stack-check -fconserve-stack -Wall -Wundef -Werror=implicit-function-declaration \
-Werror=implicit-int -Werror=return-type -Werror=strict-prototypes -Wno-format-security -Wno-trigraphs \
-Wno-frame-address -Wno-address-of-packed-member -Wframe-larger-than=1024 -Wno-main \
-Wno-unused-but-set-variable -Wno-unused-const-variable -Wvla -Wno-pointer-sign -Wcast-function-type \
-Wno-array-bounds -Wno-alloc-size-larger-than -Wimplicit-fallthrough=5 -Werror=date-time \
-Werror=incompatible-pointer-types -Werror=designated-init -Wenum-conversion -Wno-unused-but-set-variable \
-Wno-unused-const-variable -Wno-restrict -Wno-packed-not-aligned -Wno-format-overflow -Wno-format-truncation \
-Wno-stringop-overflow -Wno-stringop-truncation -Wno-missing-field-initializers -Wno-type-limits \
-Wno-shift-negative-value -Wno-maybe-uninitialized -Wno-sign-compare -g -gdwarf-5 -fsanitize=bounds-strict \
-fsanitize=shift -fsanitize=bool -fsanitize=enum -DMODULE -DKBUILD_BASENAME='"r8169_main"' \
-DKBUILD_MODNAME='"r8169"' -D__KBUILD_MODNAME=kmod_r8169 -c -o drivers/net/ethernet/realtek/r8169_main.o \
drivers/net/ethernet/realtek/r8169_main.c ; ./tools/objtool/objtool --hacks=jump_label --hacks=noinstr \
--hacks=skylake --retpoline --rethunk --sls --stackval --static-call --uaccess --prefix=16 \
--module drivers/net/ethernet/realtek/r8169_main.o

This is a build against net-next. Please find the attached config.

RTL_(R|W)(8|16|32) family are obviously macros:

#define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))

This is the writel():

7590 #APP
7591 # 66 "./arch/x86/include/asm/io.h" 1
7592 movl %ebx,176(%rax)
7593 # 0 "" 2
7594 .LVL496:
7595 #NO_APP

writel() looks optimal.

Hope this helps.

> Until there are benchmarks: NACK.

That means I've got a Reviewed-by: Jacob Keller and two NACKs.

I am voted out. :-/

I suppose one NACK from a maintainer is sufficient to halt the patch.

Going back to the documentation, the drawing board, and of course, the Source. :-)

Best regards,
Mirsad Todorovac


Attachments:
net-next-config.xz (57.09 kB)

2023-11-05 20:36:48

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention

> The command used for generating the assembly was taken from .o.cmd file and
> added -save-temps as the only change:

make drivers/net/ethernet/realtek/r8169_main.lst

is simpler. You get the C and the generated assembler listed together.

Andrew

2023-11-06 19:27:21

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention



On 11/5/23 21:36, Andrew Lunn wrote:
>> The command used for generating the assembly was taken from .o.cmd file and
>> added -save-temps as the only change:
>
> make drivers/net/ethernet/realtek/r8169_main.lst
>
> is simpler. You get the C and the generated assembler listed together.
>
> Andrew

Here, Sir:

----------------------------------------------------------------------------------------------
0000000000001950 <r8168_mac_ocp_write>:
{
1950: e8 00 00 00 00 call 1955 <r8168_mac_ocp_write+0x5>
1951: R_X86_64_PLT32 __fentry__-0x4
1955: 55 push %rbp
1956: 48 89 e5 mov %rsp,%rbp
1959: 41 57 push %r15
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
195b: 4c 8d bf 24 1a 00 00 lea 0x1a24(%rdi),%r15
{
1962: 41 56 push %r14
1964: 41 89 d6 mov %edx,%r14d
1967: 41 55 push %r13
1969: 41 54 push %r12
196b: 49 89 fc mov %rdi,%r12
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
196e: 4c 89 ff mov %r15,%rdi
{
1971: 53 push %rbx
1972: 89 f3 mov %esi,%ebx
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
1974: e8 00 00 00 00 call 1979 <r8168_mac_ocp_write+0x29>
1975: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
if (rtl_ocp_reg_failure(reg))
1979: 89 df mov %ebx,%edi
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
197b: 49 89 c5 mov %rax,%r13
if (rtl_ocp_reg_failure(reg))
197e: e8 1d f2 ff ff call ba0 <rtl_ocp_reg_failure>
1983: 84 c0 test %al,%al
1985: 75 16 jne 199d <r8168_mac_ocp_write+0x4d>
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
1987: c1 e3 0f shl $0xf,%ebx
198a: 49 8b 04 24 mov (%r12),%rax
198e: 44 09 f3 or %r14d,%ebx
1991: 81 cb 00 00 00 80 or $0x80000000,%ebx
build_mmio_write(writel, "l", unsigned int, "r", :"memory")
1997: 89 98 b0 00 00 00 mov %ebx,0xb0(%rax)
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
199d: 4c 89 ee mov %r13,%rsi
19a0: 4c 89 ff mov %r15,%rdi
19a3: e8 00 00 00 00 call 19a8 <r8168_mac_ocp_write+0x58>
19a4: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
----------------------------------------------------------------------------------------------

Actually, this time rtl_hw_start_8411_2() has r8169_mac_ocp_write mostly() inlined
without an explicit "inline".

I cannot explain why the r8169_main.s looked different with the same .config.

----------------------------------------------------------------------------------------------
0000000000008370 <rtl_hw_start_8411_2>:
{
8370: e8 00 00 00 00 call 8375 <rtl_hw_start_8411_2+0x5>
8371: R_X86_64_PLT32 __fentry__-0x4
8375: 55 push %rbp
8376: 48 89 e5 mov %rsp,%rbp
8379: 41 56 push %r14
837b: 41 55 push %r13
837d: 49 89 fd mov %rdi,%r13
8380: 41 54 push %r12
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
8382: 4d 8d a5 24 1a 00 00 lea 0x1a24(%r13),%r12
rtl_hw_start_8168g(tp);
8389: e8 02 c7 ff ff call 4a90 <rtl_hw_start_8168g>
rtl_ephy_init(tp, e_info_8411_2);
838e: ba 0a 00 00 00 mov $0xa,%edx
8393: 48 c7 c6 00 00 00 00 mov $0x0,%rsi
8396: R_X86_64_32S .rodata+0x440
839a: 4c 89 ef mov %r13,%rdi
839d: e8 be 9d ff ff call 2160 <__rtl_ephy_init>
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
83a2: 4c 89 e7 mov %r12,%rdi
83a5: e8 00 00 00 00 call 83aa <rtl_hw_start_8411_2+0x3a>
83a6: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
83aa: ba 00 00 14 fe mov $0xfe140000,%edx
83af: 48 89 c6 mov %rax,%rsi
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
83b2: 49 8b 45 00 mov 0x0(%r13),%rax
83b6: 89 90 b0 00 00 00 mov %edx,0xb0(%rax)
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
83bc: 4c 89 e7 mov %r12,%rdi
83bf: e8 00 00 00 00 call 83c4 <rtl_hw_start_8411_2+0x54>
83c0: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
83c4: 4c 89 e7 mov %r12,%rdi
83c7: e8 00 00 00 00 call 83cc <rtl_hw_start_8411_2+0x5c>
83c8: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
83cc: ba 00 00 15 fe mov $0xfe150000,%edx
83d1: 48 89 c6 mov %rax,%rsi
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
83d4: 49 8b 45 00 mov 0x0(%r13),%rax
83d8: 89 90 b0 00 00 00 mov %edx,0xb0(%rax)
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
83de: 4c 89 e7 mov %r12,%rdi
83e1: e8 00 00 00 00 call 83e6 <rtl_hw_start_8411_2+0x76>
83e2: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
83e6: 31 d2 xor %edx,%edx
83e8: be 2c fc 00 00 mov $0xfc2c,%esi
83ed: 4c 89 ef mov %r13,%rdi
83f0: e8 5b 95 ff ff call 1950 <r8168_mac_ocp_write>
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
83f5: 4c 89 e7 mov %r12,%rdi
83f8: e8 00 00 00 00 call 83fd <rtl_hw_start_8411_2+0x8d>
83f9: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
83fd: ba 00 00 17 fe mov $0xfe170000,%edx
8402: 48 89 c6 mov %rax,%rsi
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
8405: 49 8b 45 00 mov 0x0(%r13),%rax
8409: 89 90 b0 00 00 00 mov %edx,0xb0(%rax)
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
840f: 4c 89 e7 mov %r12,%rdi
8412: e8 00 00 00 00 call 8417 <rtl_hw_start_8411_2+0xa7>
8413: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
8417: 4c 89 e7 mov %r12,%rdi
841a: e8 00 00 00 00 call 841f <rtl_hw_start_8411_2+0xaf>
841b: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
841f: ba 00 00 18 fe mov $0xfe180000,%edx
8424: 48 89 c6 mov %rax,%rsi
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
8427: 49 8b 45 00 mov 0x0(%r13),%rax
842b: 89 90 b0 00 00 00 mov %edx,0xb0(%rax)
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
8431: 4c 89 e7 mov %r12,%rdi
8434: e8 00 00 00 00 call 8439 <rtl_hw_start_8411_2+0xc9>
8435: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
8439: 4c 89 e7 mov %r12,%rdi
843c: e8 00 00 00 00 call 8441 <rtl_hw_start_8411_2+0xd1>
843d: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
8441: ba 00 00 19 fe mov $0xfe190000,%edx
8446: 48 89 c6 mov %rax,%rsi
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
8449: 49 8b 45 00 mov 0x0(%r13),%rax
844d: 89 90 b0 00 00 00 mov %edx,0xb0(%rax)
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
8453: 4c 89 e7 mov %r12,%rdi
8456: e8 00 00 00 00 call 845b <rtl_hw_start_8411_2+0xeb>
8457: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
845b: 4c 89 e7 mov %r12,%rdi
845e: e8 00 00 00 00 call 8463 <rtl_hw_start_8411_2+0xf3>
845f: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
8463: ba 00 00 1a fe mov $0xfe1a0000,%edx
8468: 48 89 c6 mov %rax,%rsi
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
846b: 49 8b 45 00 mov 0x0(%r13),%rax
846f: 89 90 b0 00 00 00 mov %edx,0xb0(%rax)
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
8475: 4c 89 e7 mov %r12,%rdi
8478: e8 00 00 00 00 call 847d <rtl_hw_start_8411_2+0x10d>
8479: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
847d: 31 d2 xor %edx,%edx
847f: be 36 fc 00 00 mov $0xfc36,%esi
8484: 4c 89 ef mov %r13,%rdi
8487: e8 c4 94 ff ff call 1950 <r8168_mac_ocp_write>
mdelay(3);
----------------------------------------------------------------------------------------------


With the patch applied, and an explicit "inline" on:

static inline bool rtl_ocp_reg_failure(u32 reg);
static inline void __r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp, ...);
static inline void r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp, ...);

the code looks like:

----------------------------------------------------------------------------------------------
0000000000004d60 <rtl_hw_start_8411_2>:
{
4d60: e8 00 00 00 00 call 4d65 <rtl_hw_start_8411_2+0x5>
4d61: R_X86_64_PLT32 __fentry__-0x4
4d65: 55 push %rbp
4d66: 48 89 e5 mov %rsp,%rbp
4d69: 41 57 push %r15
4d6b: 41 56 push %r14
for (p = array; len--; p++)
4d6d: 49 c7 c6 00 00 00 00 mov $0x0,%r14
4d70: R_X86_64_32S .rodata+0x820
{
4d74: 41 55 push %r13
4d76: 41 54 push %r12
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
4d78: 41 bc 28 fc 00 00 mov $0xfc28,%r12d
{
4d7e: 53 push %rbx
4d7f: 48 89 fb mov %rdi,%rbx
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
4d82: 4c 8d ab 24 1a 00 00 lea 0x1a24(%rbx),%r13
{
4d89: 48 83 ec 08 sub $0x8,%rsp
rtl_hw_start_8168g(tp);
4d8d: e8 3e fe ff ff call 4bd0 <rtl_hw_start_8168g>
rtl_ephy_init(tp, e_info_8411_2);
4d92: ba 0a 00 00 00 mov $0xa,%edx
4d97: 48 c7 c6 00 00 00 00 mov $0x0,%rsi
4d9a: R_X86_64_32S .rodata+0x860
4d9e: 48 89 df mov %rbx,%rdi
4da1: e8 fa d4 ff ff call 22a0 <__rtl_ephy_init>
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
4da6: 4c 89 ef mov %r13,%rdi
4da9: e8 00 00 00 00 call 4dae <rtl_hw_start_8411_2+0x4e>
4daa: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
4dae: 31 c9 xor %ecx,%ecx
4db0: 49 89 c7 mov %rax,%r15
for (p = array; len--; p++)
4db3: eb 07 jmp 4dbc <rtl_hw_start_8411_2+0x5c>
__r8168_mac_ocp_write(tp, p->reg, p->data);
4db5: 41 8b 4e 04 mov 0x4(%r14),%ecx
4db9: 45 8b 26 mov (%r14),%r12d
return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg);
4dbc: 41 f7 c4 01 00 ff ff test $0xffff0001,%r12d
4dc3: 0f 85 9b 01 00 00 jne 4f64 <rtl_hw_start_8411_2+0x204>
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
4dc9: 41 c1 e4 0f shl $0xf,%r12d
4dcd: 48 8b 03 mov (%rbx),%rax
4dd0: 41 09 cc or %ecx,%r12d
4dd3: 41 81 cc 00 00 00 80 or $0x80000000,%r12d
build_mmio_write(writel, "l", unsigned int, "r", :"memory")
4dda: 44 89 a0 b0 00 00 00 mov %r12d,0xb0(%rax)
for (p = array; len--; p++)
4de1: 49 83 c6 08 add $0x8,%r14
4de5: 49 81 fe 00 00 00 00 cmp $0x0,%r14
4de8: R_X86_64_32S .rodata+0x860
4dec: 75 c7 jne 4db5 <rtl_hw_start_8411_2+0x55>
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
4dee: 4c 89 fe mov %r15,%rsi
4df1: 4c 89 ef mov %r13,%rdi
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
4df4: 41 bc 00 f8 00 00 mov $0xf800,%r12d
for (p = array; len--; p++)
4dfa: 49 c7 c6 00 00 00 00 mov $0x0,%r14
4dfd: R_X86_64_32S .rodata+0x4a0
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
4e01: e8 00 00 00 00 call 4e06 <rtl_hw_start_8411_2+0xa6>
4e02: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
mdelay(3);
4e06: bf 08 9c c4 00 mov $0xc49c08,%edi
4e0b: e8 00 00 00 00 call 4e10 <rtl_hw_start_8411_2+0xb0>
4e0c: R_X86_64_PLT32 __const_udelay-0x4
r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
4e10: 31 d2 xor %edx,%edx
4e12: be 26 fc 00 00 mov $0xfc26,%esi
4e17: 48 89 df mov %rbx,%rdi
4e1a: e8 71 d0 ff ff call 1e90 <r8168_mac_ocp_write>
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
4e1f: 4c 89 ef mov %r13,%rdi
4e22: e8 00 00 00 00 call 4e27 <rtl_hw_start_8411_2+0xc7>
4e23: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
4e27: b9 08 e0 00 00 mov $0xe008,%ecx
4e2c: 49 89 c7 mov %rax,%r15
for (p = array; len--; p++)
4e2f: eb 07 jmp 4e38 <rtl_hw_start_8411_2+0xd8>
__r8168_mac_ocp_write(tp, p->reg, p->data);
4e31: 41 8b 4e 04 mov 0x4(%r14),%ecx
4e35: 45 8b 26 mov (%r14),%r12d
return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg);
4e38: 41 f7 c4 01 00 ff ff test $0xffff0001,%r12d
4e3f: 0f 85 eb 00 00 00 jne 4f30 <rtl_hw_start_8411_2+0x1d0>
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
4e45: 41 c1 e4 0f shl $0xf,%r12d
4e49: 48 8b 03 mov (%rbx),%rax
4e4c: 41 09 cc or %ecx,%r12d
4e4f: 41 81 cc 00 00 00 80 or $0x80000000,%r12d
4e56: 44 89 a0 b0 00 00 00 mov %r12d,0xb0(%rax)
for (p = array; len--; p++)
4e5d: 49 83 c6 08 add $0x8,%r14
4e61: 49 81 fe 00 00 00 00 cmp $0x0,%r14
4e64: R_X86_64_32S .rodata+0x818
4e68: 75 c7 jne 4e31 <rtl_hw_start_8411_2+0xd1>
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
4e6a: 4c 89 fe mov %r15,%rsi
4e6d: 4c 89 ef mov %r13,%rdi
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
4e70: 41 bc 2a fc 00 00 mov $0xfc2a,%r12d
for (p = array; len--; p++)
4e76: 49 c7 c6 00 00 00 00 mov $0x0,%r14
4e79: R_X86_64_32S .rodata+0x460
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
4e7d: e8 00 00 00 00 call 4e82 <rtl_hw_start_8411_2+0x122>
4e7e: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
4e82: ba 00 80 00 00 mov $0x8000,%edx
4e87: be 26 fc 00 00 mov $0xfc26,%esi
4e8c: 48 89 df mov %rbx,%rdi
4e8f: e8 fc cf ff ff call 1e90 <r8168_mac_ocp_write>
raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
4e94: 4c 89 ef mov %r13,%rdi
4e97: e8 00 00 00 00 call 4e9c <rtl_hw_start_8411_2+0x13c>
4e98: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4
4e9c: b9 43 07 00 00 mov $0x743,%ecx
4ea1: 49 89 c7 mov %rax,%r15
for (p = array; len--; p++)
4ea4: eb 07 jmp 4ead <rtl_hw_start_8411_2+0x14d>
__r8168_mac_ocp_write(tp, p->reg, p->data);
4ea6: 41 8b 4e 04 mov 0x4(%r14),%ecx
4eaa: 45 8b 26 mov (%r14),%r12d
return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg);
4ead: 41 f7 c4 01 00 ff ff test $0xffff0001,%r12d
4eb4: 75 4d jne 4f03 <rtl_hw_start_8411_2+0x1a3>
RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
4eb6: 41 c1 e4 0f shl $0xf,%r12d
4eba: 48 8b 03 mov (%rbx),%rax
4ebd: 41 09 cc or %ecx,%r12d
4ec0: 41 81 cc 00 00 00 80 or $0x80000000,%r12d
4ec7: 44 89 a0 b0 00 00 00 mov %r12d,0xb0(%rax)
for (p = array; len--; p++)
4ece: 49 83 c6 08 add $0x8,%r14
4ed2: 49 81 fe 00 00 00 00 cmp $0x0,%r14
4ed5: R_X86_64_32S .rodata+0x498
4ed9: 75 cb jne 4ea6 <rtl_hw_start_8411_2+0x146>
raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
4edb: 4c 89 fe mov %r15,%rsi
4ede: 4c 89 ef mov %r13,%rdi
4ee1: e8 00 00 00 00 call 4ee6 <rtl_hw_start_8411_2+0x186>
4ee2: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4
}
4ee6: 48 83 c4 08 add $0x8,%rsp
4eea: 5b pop %rbx
4eeb: 41 5c pop %r12
4eed: 41 5d pop %r13
4eef: 41 5e pop %r14
4ef1: 41 5f pop %r15
4ef3: 5d pop %rbp
4ef4: 31 c0 xor %eax,%eax
4ef6: 31 d2 xor %edx,%edx
4ef8: 31 c9 xor %ecx,%ecx
4efa: 31 f6 xor %esi,%esi
4efc: 31 ff xor %edi,%edi
4efe: e9 00 00 00 00 jmp 4f03 <rtl_hw_start_8411_2+0x1a3>
4eff: R_X86_64_PLT32 __x86_return_thunk-0x4
return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg);
4f03: 0f b6 05 00 00 00 00 movzbl 0x0(%rip),%eax # 4f0a <rtl_hw_start_8411_2+0x1aa>
4f06: R_X86_64_PC32 .data.once-0x2
4f0a: 3c 01 cmp $0x1,%al
4f0c: 0f 87 00 00 00 00 ja 4f12 <rtl_hw_start_8411_2+0x1b2>
4f0e: R_X86_64_PC32 .text.unlikely+0xd0
4f12: a8 01 test $0x1,%al
4f14: 75 b8 jne 4ece <rtl_hw_start_8411_2+0x16e>
4f16: 44 89 e6 mov %r12d,%esi
4f19: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
4f1c: R_X86_64_32S .rodata.str1.1+0x53
4f20: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4f27 <rtl_hw_start_8411_2+0x1c7>
4f22: R_X86_64_PC32 .data.once-0x3
4f27: e8 00 00 00 00 call 4f2c <rtl_hw_start_8411_2+0x1cc>
4f28: R_X86_64_PLT32 __warn_printk-0x4
4f2c: 0f 0b ud2
4f2e: eb 9e jmp 4ece <rtl_hw_start_8411_2+0x16e>
4f30: 0f b6 05 00 00 00 00 movzbl 0x0(%rip),%eax # 4f37 <rtl_hw_start_8411_2+0x1d7>
4f33: R_X86_64_PC32 .data.once-0x2
4f37: 3c 01 cmp $0x1,%al
4f39: 0f 87 00 00 00 00 ja 4f3f <rtl_hw_start_8411_2+0x1df>
4f3b: R_X86_64_PC32 .text.unlikely+0x106
4f3f: a8 01 test $0x1,%al
4f41: 0f 85 16 ff ff ff jne 4e5d <rtl_hw_start_8411_2+0xfd>
4f47: 44 89 e6 mov %r12d,%esi
4f4a: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
4f4d: R_X86_64_32S .rodata.str1.1+0x53
4f51: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4f58 <rtl_hw_start_8411_2+0x1f8>
4f53: R_X86_64_PC32 .data.once-0x3
4f58: e8 00 00 00 00 call 4f5d <rtl_hw_start_8411_2+0x1fd>
4f59: R_X86_64_PLT32 __warn_printk-0x4
4f5d: 0f 0b ud2
4f5f: e9 f9 fe ff ff jmp 4e5d <rtl_hw_start_8411_2+0xfd>
4f64: 0f b6 05 00 00 00 00 movzbl 0x0(%rip),%eax # 4f6b <rtl_hw_start_8411_2+0x20b>
4f67: R_X86_64_PC32 .data.once-0x2
4f6b: 3c 01 cmp $0x1,%al
4f6d: 0f 87 00 00 00 00 ja 4f73 <rtl_hw_start_8411_2+0x213>
4f6f: R_X86_64_PC32 .text.unlikely+0xeb
4f73: a8 01 test $0x1,%al
4f75: 0f 85 66 fe ff ff jne 4de1 <rtl_hw_start_8411_2+0x81>
4f7b: 44 89 e6 mov %r12d,%esi
4f7e: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
4f81: R_X86_64_32S .rodata.str1.1+0x53
4f85: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4f8c <rtl_hw_start_8411_2+0x22c>
4f87: R_X86_64_PC32 .data.once-0x3
4f8c: e8 00 00 00 00 call 4f91 <rtl_hw_start_8411_2+0x231>
4f8d: R_X86_64_PLT32 __warn_printk-0x4
4f91: 0f 0b ud2
4f93: e9 49 fe ff ff jmp 4de1 <rtl_hw_start_8411_2+0x81>
4f98: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
4f9f: 00
4fa0: 90 nop
4fa1: 90 nop
4fa2: 90 nop
4fa3: 90 nop
4fa4: 90 nop
4fa5: 90 nop
4fa6: 90 nop
4fa7: 90 nop
4fa8: 90 nop
4fa9: 90 nop
4faa: 90 nop
4fab: 90 nop
4fac: 90 nop
4fad: 90 nop
4fae: 90 nop
4faf: 90 nop
----------------------------------------------------------------------------------------------

Best regards,
Mirsad Todorovac