2023-09-27 18:50:52

by Mirsad Todorovac

[permalink] [raw]
Subject: [PATCH RFC v1 1/3] r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx

KCSAN reported the following data-race:

==================================================================
BUG: KCSAN: data-race in rtl8169_poll [r8169] / rtl8169_start_xmit [r8169]

write (marked) to 0xffff888102474b74 of 4 bytes by task 5358 on cpu 29:
rtl8169_start_xmit (drivers/net/ethernet/realtek/r8169_main.c:4254) r8169
dev_hard_start_xmit (./include/linux/netdevice.h:4889 ./include/linux/netdevice.h:4903 net/core/dev.c:3544 net/core/dev.c:3560)
sch_direct_xmit (net/sched/sch_generic.c:342)
__dev_queue_xmit (net/core/dev.c:3817 net/core/dev.c:4306)
ip_finish_output2 (./include/linux/netdevice.h:3082 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv4/ip_output.c:233)
__ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:293)
ip_finish_output (net/ipv4/ip_output.c:328)
ip_output (net/ipv4/ip_output.c:435)
ip_send_skb (./include/net/dst.h:458 net/ipv4/ip_output.c:127 net/ipv4/ip_output.c:1486)
udp_send_skb (net/ipv4/udp.c:963)
udp_sendmsg (net/ipv4/udp.c:1246)
inet_sendmsg (net/ipv4/af_inet.c:840 (discriminator 4))
sock_sendmsg (net/socket.c:730 net/socket.c:753)
__sys_sendto (net/socket.c:2177)
__x64_sys_sendto (net/socket.c:2185)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

read to 0xffff888102474b74 of 4 bytes by interrupt on cpu 21:
rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4397 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
__napi_poll (net/core/dev.c:6527)
net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
__do_softirq (kernel/softirq.c:553)
__irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
irq_exit_rcu (kernel/softirq.c:647)
common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636)
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
call_cpuidle (kernel/sched/idle.c:135)
do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)

value changed: 0x002f4815 -> 0x002f4816

Reported by Kernel Concurrency Sanitizer on:
CPU: 21 PID: 0 Comm: swapper/21 Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

The write side of drivers/net/ethernet/realtek/r8169_main.c is:
==================
4251 /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */
4252 smp_wmb();
4253
→ 4254 WRITE_ONCE(tp->cur_tx, tp->cur_tx + frags + 1);
4255
4256 stop_queue = !netif_subqueue_maybe_stop(dev, 0, rtl_tx_slots_avail(tp),
4257 R8169_TX_STOP_THRS,
4258 R8169_TX_START_THRS);

The read side is the function rtl_tx():

4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
4356 int budget)
4357 {
4358 unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
4359 struct sk_buff *skb;
4360
4361 dirty_tx = tp->dirty_tx;
4362
4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
4364 unsigned int entry = dirty_tx % NUM_TX_DESC;
4365 u32 status;
4366
4367 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
4368 if (status & DescOwn)
4369 break;
4370
4371 skb = tp->tx_skb[entry].skb;
4372 rtl8169_unmap_tx_skb(tp, entry);
4373
4374 if (skb) {
4375 pkts_compl++;
4376 bytes_compl += skb->len;
4377 napi_consume_skb(skb, budget);
4378 }
4379 dirty_tx++;
4380 }
4381
4382 if (tp->dirty_tx != dirty_tx) {
4383 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
4384 WRITE_ONCE(tp->dirty_tx, dirty_tx);
4385
4386 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
4387 rtl_tx_slots_avail(tp),
4388 R8169_TX_START_THRS);
4389 /*
4390 * 8168 hack: TxPoll requests are lost when the Tx packets are
4391 * too close. Let's kick an extra TxPoll request when a burst
4392 * of start_xmit activity is detected (if it is not detected,
4393 * it is slow enough). -- FR
4394 * If skb is NULL then we come here again once a tx irq is
4395 * triggered after the last fragment is marked transmitted.
4396 */
→ 4397 if (tp->cur_tx != dirty_tx && skb)
4398 rtl8169_doorbell(tp);
4399 }
4400 }

Obviously from the code, an earlier detected data-race for tp->cur_tx was fixed in the
line 4363:

4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {

but the same solution is required for protecting the other access to tp->cur_tx:

→ 4397 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
4398 rtl8169_doorbell(tp);

The write in the line 4254 is protected with WRITE_ONCE(), but the read in the line 4397
might have suffered read tearing under some compiler optimisations.

The fix eliminated the KCSAN data-race report for this bug.

It is yet to be evaluated what happens if tp->cur_tx changes between the test in line 4363
and line 4397. This test should certainly not be cached by the compiler in some register
for such a long time, while asynchronous writes to tp->cur_tx might have occurred in line
4254 in the meantime.

Fixes: 94d8a98e6235c ("r8169: reduce number of workaround doorbell rings")
Cc: Heiner Kallweit <[email protected]>
Cc: [email protected]
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mirsad Goran Todorovac <[email protected]>
---
v1:
the initial patch proposal. fixes the KCSAN warning.

drivers/net/ethernet/realtek/r8169_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 6351a2dc13bc..281aaa851847 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4394,7 +4394,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
* If skb is NULL then we come here again once a tx irq is
* triggered after the last fragment is marked transmitted.
*/
- if (tp->cur_tx != dirty_tx && skb)
+ if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
rtl8169_doorbell(tp);
}
}
--
2.34.1


2023-09-27 18:54:52

by Mirsad Todorovac

[permalink] [raw]
Subject: [PATCH RFC v1 3/3] r8169: fix the KCSAN reported data-race in rtl_tx while reading TxDescArray[entry].opts1

KCSAN reported the following data-race:

==================================================================
BUG: KCSAN: data-race in rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4368 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169

race at unknown origin, with read to 0xffff888140d37570 of 4 bytes by interrupt on cpu 21:
rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4368 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
__napi_poll (net/core/dev.c:6527)
net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
__do_softirq (kernel/softirq.c:553)
__irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
irq_exit_rcu (kernel/softirq.c:647)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074 (discriminator 14))
asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:645)
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
call_cpuidle (kernel/sched/idle.c:135)
do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)

value changed: 0xb0000042 -> 0x00000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 21 PID: 0 Comm: swapper/21 Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

The read side is in

drivers/net/ethernet/realtek/r8169_main.c
=========================================
4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
4356 int budget)
4357 {
4358 unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
4359 struct sk_buff *skb;
4360
4361 dirty_tx = tp->dirty_tx;
4362
4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
4364 unsigned int entry = dirty_tx % NUM_TX_DESC;
4365 u32 status;
4366
→ 4367 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
4368 if (status & DescOwn)
4369 break;
4370
4371 skb = tp->tx_skb[entry].skb;
4372 rtl8169_unmap_tx_skb(tp, entry);
4373
4374 if (skb) {
4375 pkts_compl++;
4376 bytes_compl += skb->len;
4377 napi_consume_skb(skb, budget);
4378 }
4379 dirty_tx++;
4380 }
4381
4382 if (tp->dirty_tx != dirty_tx) {
4383 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
4384 WRITE_ONCE(tp->dirty_tx, dirty_tx);
4385
4386 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
4387 rtl_tx_slots_avail(tp),
4388 R8169_TX_START_THRS);
4389 /*
4390 * 8168 hack: TxPoll requests are lost when the Tx packets are
4391 * too close. Let's kick an extra TxPoll request when a burst
4392 * of start_xmit activity is detected (if it is not detected,
4393 * it is slow enough). -- FR
4394 * If skb is NULL then we come here again once a tx irq is
4395 * triggered after the last fragment is marked transmitted.
4396 */
4397 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
4398 rtl8169_doorbell(tp);
4399 }
4400 }

tp->TxDescArray[entry].opts1 is reported to have a data-race and READ_ONCE() fixes
this KCSAN warning.

4366
→ 4367 status = le32_to_cpu(READ_ONCE(tp->TxDescArray[entry].opts1));
4368 if (status & DescOwn)
4369 break;
4370

Fixes: ^1da177e4c3f4 ("initial git repository build")
Cc: Heiner Kallweit <[email protected]>
Cc: [email protected]
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mirsad Goran Todorovac <[email protected]>
---
v1:
the initial patch proposal. fixes the KCSAN warning.

drivers/net/ethernet/realtek/r8169_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 81be6085a480..361b90007148 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4364,7 +4364,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
unsigned int entry = dirty_tx % NUM_TX_DESC;
u32 status;

- status = le32_to_cpu(tp->TxDescArray[entry].opts1);
+ status = le32_to_cpu(READ_ONCE(tp->TxDescArray[entry].opts1));
if (status & DescOwn)
break;

--
2.34.1

2023-09-27 19:40:14

by Mirsad Todorovac

[permalink] [raw]
Subject: [PATCH RFC v1 2/3] r8169: fix the KCSAN reported data race in rtl_rx while reading desc->opts1

KCSAN reported the following data-race bug:

==================================================================
BUG: KCSAN: data-race in rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4430 drivers/net/ethernet/realtek/r8169_main.c:4583) r8169

race at unknown origin, with read to 0xffff888117e43510 of 4 bytes by interrupt on cpu 21:
rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4430 drivers/net/ethernet/realtek/r8169_main.c:4583) r8169
__napi_poll (net/core/dev.c:6527)
net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
__do_softirq (kernel/softirq.c:553)
__irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
irq_exit_rcu (kernel/softirq.c:647)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074 (discriminator 14))
asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:645)
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
call_cpuidle (kernel/sched/idle.c:135)
do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)

value changed: 0x80003fff -> 0x3402805f

Reported by Kernel Concurrency Sanitizer on:
CPU: 21 PID: 0 Comm: swapper/21 Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

drivers/net/ethernet/realtek/r8169_main.c:
==========================================
4429
→ 4430 status = le32_to_cpu(desc->opts1);
4431 if (status & DescOwn)
4432 break;
4433
4434 /* This barrier is needed to keep us from reading
4435 * any other fields out of the Rx descriptor until
4436 * we know the status of DescOwn
4437 */
4438 dma_rmb();
4439
4440 if (unlikely(status & RxRES)) {
4441 if (net_ratelimit())
4442 netdev_warn(dev, "Rx ERROR. status = %08x\n",

Marco Elver explained that dma_rmb() doesn't prevent the compiler to tear up the access to
desc->opts1 which can be written to concurrently. READ_ONCE() should prevent that from
happening:

4429
→ 4430 status = le32_to_cpu(READ_ONCE(desc->opts1));
4431 if (status & DescOwn)
4432 break;
4433

As the consequence of this fix, this KCSAN warning was eliminated.

Fixes: 6202806e7c03a ("r8169: drop member opts1_mask from struct rtl8169_private")
Suggested-by: Marco Elver <[email protected]>
Cc: Heiner Kallweit <[email protected]>
Cc: [email protected]
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mirsad Goran Todorovac <[email protected]>

Signed-off-by: Mirsad Goran Todorovac <[email protected]>
---
v1:
the initial patch proposal. fixes the KCSAN warning.

drivers/net/ethernet/realtek/r8169_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 281aaa851847..81be6085a480 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4427,7 +4427,7 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, int budget
dma_addr_t addr;
u32 status;

- status = le32_to_cpu(desc->opts1);
+ status = le32_to_cpu(READ_ONCE(desc->opts1));
if (status & DescOwn)
break;

--
2.34.1

2023-09-28 04:05:56

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH RFC v1 1/3] r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx

On 27.09.2023 20:41, Mirsad Goran Todorovac wrote:
> KCSAN reported the following data-race:
>
> ==================================================================
> BUG: KCSAN: data-race in rtl8169_poll [r8169] / rtl8169_start_xmit [r8169]
>
> write (marked) to 0xffff888102474b74 of 4 bytes by task 5358 on cpu 29:
> rtl8169_start_xmit (drivers/net/ethernet/realtek/r8169_main.c:4254) r8169
> dev_hard_start_xmit (./include/linux/netdevice.h:4889 ./include/linux/netdevice.h:4903 net/core/dev.c:3544 net/core/dev.c:3560)
> sch_direct_xmit (net/sched/sch_generic.c:342)
> __dev_queue_xmit (net/core/dev.c:3817 net/core/dev.c:4306)
> ip_finish_output2 (./include/linux/netdevice.h:3082 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv4/ip_output.c:233)
> __ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:293)
> ip_finish_output (net/ipv4/ip_output.c:328)
> ip_output (net/ipv4/ip_output.c:435)
> ip_send_skb (./include/net/dst.h:458 net/ipv4/ip_output.c:127 net/ipv4/ip_output.c:1486)
> udp_send_skb (net/ipv4/udp.c:963)
> udp_sendmsg (net/ipv4/udp.c:1246)
> inet_sendmsg (net/ipv4/af_inet.c:840 (discriminator 4))
> sock_sendmsg (net/socket.c:730 net/socket.c:753)
> __sys_sendto (net/socket.c:2177)
> __x64_sys_sendto (net/socket.c:2185)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
>
> read to 0xffff888102474b74 of 4 bytes by interrupt on cpu 21:
> rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4397 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
> __napi_poll (net/core/dev.c:6527)
> net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
> __do_softirq (kernel/softirq.c:553)
> __irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
> irq_exit_rcu (kernel/softirq.c:647)
> common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636)
> cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
> cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
> call_cpuidle (kernel/sched/idle.c:135)
> do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
> cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
> start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
> secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)
>
> value changed: 0x002f4815 -> 0x002f4816
>
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 21 PID: 0 Comm: swapper/21 Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
> Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
> ==================================================================
>
> The write side of drivers/net/ethernet/realtek/r8169_main.c is:
> ==================
> 4251 /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */
> 4252 smp_wmb();
> 4253
> → 4254 WRITE_ONCE(tp->cur_tx, tp->cur_tx + frags + 1);
> 4255
> 4256 stop_queue = !netif_subqueue_maybe_stop(dev, 0, rtl_tx_slots_avail(tp),
> 4257 R8169_TX_STOP_THRS,
> 4258 R8169_TX_START_THRS);
>
> The read side is the function rtl_tx():
>
> 4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
> 4356 int budget)
> 4357 {
> 4358 unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
> 4359 struct sk_buff *skb;
> 4360
> 4361 dirty_tx = tp->dirty_tx;
> 4362
> 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
> 4364 unsigned int entry = dirty_tx % NUM_TX_DESC;
> 4365 u32 status;
> 4366
> 4367 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
> 4368 if (status & DescOwn)
> 4369 break;
> 4370
> 4371 skb = tp->tx_skb[entry].skb;
> 4372 rtl8169_unmap_tx_skb(tp, entry);
> 4373
> 4374 if (skb) {
> 4375 pkts_compl++;
> 4376 bytes_compl += skb->len;
> 4377 napi_consume_skb(skb, budget);
> 4378 }
> 4379 dirty_tx++;
> 4380 }
> 4381
> 4382 if (tp->dirty_tx != dirty_tx) {
> 4383 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
> 4384 WRITE_ONCE(tp->dirty_tx, dirty_tx);
> 4385
> 4386 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
> 4387 rtl_tx_slots_avail(tp),
> 4388 R8169_TX_START_THRS);
> 4389 /*
> 4390 * 8168 hack: TxPoll requests are lost when the Tx packets are
> 4391 * too close. Let's kick an extra TxPoll request when a burst
> 4392 * of start_xmit activity is detected (if it is not detected,
> 4393 * it is slow enough). -- FR
> 4394 * If skb is NULL then we come here again once a tx irq is
> 4395 * triggered after the last fragment is marked transmitted.
> 4396 */
> → 4397 if (tp->cur_tx != dirty_tx && skb)
> 4398 rtl8169_doorbell(tp);
> 4399 }
> 4400 }
>
> Obviously from the code, an earlier detected data-race for tp->cur_tx was fixed in the
> line 4363:
>
> 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
>
> but the same solution is required for protecting the other access to tp->cur_tx:
>
> → 4397 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
> 4398 rtl8169_doorbell(tp);
>
> The write in the line 4254 is protected with WRITE_ONCE(), but the read in the line 4397
> might have suffered read tearing under some compiler optimisations.
>
> The fix eliminated the KCSAN data-race report for this bug.
>
> It is yet to be evaluated what happens if tp->cur_tx changes between the test in line 4363
> and line 4397. This test should certainly not be cached by the compiler in some register
> for such a long time, while asynchronous writes to tp->cur_tx might have occurred in line
> 4254 in the meantime.
>

netif_subqueue_completed_wake() has barriers ensuring that no cached value for tp->cur_tx
is used in line 4397. I'm not aware of any reported issues with an obvious link to the
potentential issue you describe.
I don't have a strong opinion on these patches. They shouldn't hurt, and if they make
KCSAN happy, why not.

> Fixes: 94d8a98e6235c ("r8169: reduce number of workaround doorbell rings")
> Cc: Heiner Kallweit <[email protected]>
> Cc: [email protected]
> Cc: "David S. Miller" <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Jakub Kicinski <[email protected]>
> Cc: Paolo Abeni <[email protected]>
> Cc: Marco Elver <[email protected]>
> Cc: [email protected]
> Link: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
> ---
> v1:
> the initial patch proposal. fixes the KCSAN warning.
>
> drivers/net/ethernet/realtek/r8169_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> index 6351a2dc13bc..281aaa851847 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -4394,7 +4394,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
> * If skb is NULL then we come here again once a tx irq is
> * triggered after the last fragment is marked transmitted.
> */
> - if (tp->cur_tx != dirty_tx && skb)
> + if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
> rtl8169_doorbell(tp);
> }
> }

2023-09-28 16:48:31

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH RFC v1 1/3] r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx

On Wed, 27 Sept 2023 at 21:52, Heiner Kallweit <[email protected]> wrote:
>
> On 27.09.2023 20:41, Mirsad Goran Todorovac wrote:
> > KCSAN reported the following data-race:
> >
> > ==================================================================
> > BUG: KCSAN: data-race in rtl8169_poll [r8169] / rtl8169_start_xmit [r8169]
> >
> > write (marked) to 0xffff888102474b74 of 4 bytes by task 5358 on cpu 29:
> > rtl8169_start_xmit (drivers/net/ethernet/realtek/r8169_main.c:4254) r8169
> > dev_hard_start_xmit (./include/linux/netdevice.h:4889 ./include/linux/netdevice.h:4903 net/core/dev.c:3544 net/core/dev.c:3560)
> > sch_direct_xmit (net/sched/sch_generic.c:342)
> > __dev_queue_xmit (net/core/dev.c:3817 net/core/dev.c:4306)
> > ip_finish_output2 (./include/linux/netdevice.h:3082 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv4/ip_output.c:233)
> > __ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:293)
> > ip_finish_output (net/ipv4/ip_output.c:328)
> > ip_output (net/ipv4/ip_output.c:435)
> > ip_send_skb (./include/net/dst.h:458 net/ipv4/ip_output.c:127 net/ipv4/ip_output.c:1486)
> > udp_send_skb (net/ipv4/udp.c:963)
> > udp_sendmsg (net/ipv4/udp.c:1246)
> > inet_sendmsg (net/ipv4/af_inet.c:840 (discriminator 4))
> > sock_sendmsg (net/socket.c:730 net/socket.c:753)
> > __sys_sendto (net/socket.c:2177)
> > __x64_sys_sendto (net/socket.c:2185)
> > do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> >
> > read to 0xffff888102474b74 of 4 bytes by interrupt on cpu 21:
> > rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4397 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
> > __napi_poll (net/core/dev.c:6527)
> > net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
> > __do_softirq (kernel/softirq.c:553)
> > __irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
> > irq_exit_rcu (kernel/softirq.c:647)
> > common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
> > asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636)
> > cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
> > cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
> > call_cpuidle (kernel/sched/idle.c:135)
> > do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
> > cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
> > start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
> > secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)
> >
> > value changed: 0x002f4815 -> 0x002f4816
> >
> > Reported by Kernel Concurrency Sanitizer on:
> > CPU: 21 PID: 0 Comm: swapper/21 Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
> > Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
> > ==================================================================
> >
> > The write side of drivers/net/ethernet/realtek/r8169_main.c is:
> > ==================
> > 4251 /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */
> > 4252 smp_wmb();
> > 4253
> > → 4254 WRITE_ONCE(tp->cur_tx, tp->cur_tx + frags + 1);
> > 4255
> > 4256 stop_queue = !netif_subqueue_maybe_stop(dev, 0, rtl_tx_slots_avail(tp),
> > 4257 R8169_TX_STOP_THRS,
> > 4258 R8169_TX_START_THRS);
> >
> > The read side is the function rtl_tx():
> >
> > 4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
> > 4356 int budget)
> > 4357 {
> > 4358 unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
> > 4359 struct sk_buff *skb;
> > 4360
> > 4361 dirty_tx = tp->dirty_tx;
> > 4362
> > 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
> > 4364 unsigned int entry = dirty_tx % NUM_TX_DESC;
> > 4365 u32 status;
> > 4366
> > 4367 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
> > 4368 if (status & DescOwn)
> > 4369 break;
> > 4370
> > 4371 skb = tp->tx_skb[entry].skb;
> > 4372 rtl8169_unmap_tx_skb(tp, entry);
> > 4373
> > 4374 if (skb) {
> > 4375 pkts_compl++;
> > 4376 bytes_compl += skb->len;
> > 4377 napi_consume_skb(skb, budget);
> > 4378 }
> > 4379 dirty_tx++;
> > 4380 }
> > 4381
> > 4382 if (tp->dirty_tx != dirty_tx) {
> > 4383 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
> > 4384 WRITE_ONCE(tp->dirty_tx, dirty_tx);
> > 4385
> > 4386 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
> > 4387 rtl_tx_slots_avail(tp),
> > 4388 R8169_TX_START_THRS);
> > 4389 /*
> > 4390 * 8168 hack: TxPoll requests are lost when the Tx packets are
> > 4391 * too close. Let's kick an extra TxPoll request when a burst
> > 4392 * of start_xmit activity is detected (if it is not detected,
> > 4393 * it is slow enough). -- FR
> > 4394 * If skb is NULL then we come here again once a tx irq is
> > 4395 * triggered after the last fragment is marked transmitted.
> > 4396 */
> > → 4397 if (tp->cur_tx != dirty_tx && skb)
> > 4398 rtl8169_doorbell(tp);
> > 4399 }
> > 4400 }
> >
> > Obviously from the code, an earlier detected data-race for tp->cur_tx was fixed in the
> > line 4363:
> >
> > 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
> >
> > but the same solution is required for protecting the other access to tp->cur_tx:
> >
> > → 4397 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
> > 4398 rtl8169_doorbell(tp);
> >
> > The write in the line 4254 is protected with WRITE_ONCE(), but the read in the line 4397
> > might have suffered read tearing under some compiler optimisations.
> >
> > The fix eliminated the KCSAN data-race report for this bug.
> >
> > It is yet to be evaluated what happens if tp->cur_tx changes between the test in line 4363
> > and line 4397. This test should certainly not be cached by the compiler in some register
> > for such a long time, while asynchronous writes to tp->cur_tx might have occurred in line
> > 4254 in the meantime.
> >
>
> netif_subqueue_completed_wake() has barriers ensuring that no cached value for tp->cur_tx
> is used in line 4397. I'm not aware of any reported issues with an obvious link to the
> potentential issue you describe.
> I don't have a strong opinion on these patches. They shouldn't hurt, and if they make
> KCSAN happy, why not.

Barries don't protect unmarked accesses from being miscompiled. So the
use of barriers and marked accesses like READ_ONCE() is correct:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt
That said, actually encountering a miscompilation depends on
architecture and compiler. Using the right marked accesses just
ensures things don't suddenly break because the compiler decided to be
a little more clever.

> > Fixes: 94d8a98e6235c ("r8169: reduce number of workaround doorbell rings")
> > Cc: Heiner Kallweit <[email protected]>
> > Cc: [email protected]
> > Cc: "David S. Miller" <[email protected]>
> > Cc: Eric Dumazet <[email protected]>
> > Cc: Jakub Kicinski <[email protected]>
> > Cc: Paolo Abeni <[email protected]>
> > Cc: Marco Elver <[email protected]>
> > Cc: [email protected]
> > Link: https://lore.kernel.org/lkml/[email protected]/
> > Signed-off-by: Mirsad Goran Todorovac <[email protected]>

Acked-by: Marco Elver <[email protected]>

> > ---
> > v1:
> > the initial patch proposal. fixes the KCSAN warning.
> >
> > drivers/net/ethernet/realtek/r8169_main.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> > index 6351a2dc13bc..281aaa851847 100644
> > --- a/drivers/net/ethernet/realtek/r8169_main.c
> > +++ b/drivers/net/ethernet/realtek/r8169_main.c
> > @@ -4394,7 +4394,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
> > * If skb is NULL then we come here again once a tx irq is
> > * triggered after the last fragment is marked transmitted.
> > */
> > - if (tp->cur_tx != dirty_tx && skb)
> > + if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
> > rtl8169_doorbell(tp);
> > }
> > }
>

2023-10-13 05:02:35

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH RFC v1 1/3] r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx

On 9/28/2023 8:02 AM, Marco Elver wrote:
> On Wed, 27 Sept 2023 at 21:52, Heiner Kallweit <[email protected]> wrote:
>>
>> On 27.09.2023 20:41, Mirsad Goran Todorovac wrote:
>>> KCSAN reported the following data-race:
>>>
>>> ==================================================================
>>> BUG: KCSAN: data-race in rtl8169_poll [r8169] / rtl8169_start_xmit [r8169]
>>>
>>> write (marked) to 0xffff888102474b74 of 4 bytes by task 5358 on cpu 29:
>>> rtl8169_start_xmit (drivers/net/ethernet/realtek/r8169_main.c:4254) r8169
>>> dev_hard_start_xmit (./include/linux/netdevice.h:4889 ./include/linux/netdevice.h:4903 net/core/dev.c:3544 net/core/dev.c:3560)
>>> sch_direct_xmit (net/sched/sch_generic.c:342)
>>> __dev_queue_xmit (net/core/dev.c:3817 net/core/dev.c:4306)
>>> ip_finish_output2 (./include/linux/netdevice.h:3082 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv4/ip_output.c:233)
>>> __ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:293)
>>> ip_finish_output (net/ipv4/ip_output.c:328)
>>> ip_output (net/ipv4/ip_output.c:435)
>>> ip_send_skb (./include/net/dst.h:458 net/ipv4/ip_output.c:127 net/ipv4/ip_output.c:1486)
>>> udp_send_skb (net/ipv4/udp.c:963)
>>> udp_sendmsg (net/ipv4/udp.c:1246)
>>> inet_sendmsg (net/ipv4/af_inet.c:840 (discriminator 4))
>>> sock_sendmsg (net/socket.c:730 net/socket.c:753)
>>> __sys_sendto (net/socket.c:2177)
>>> __x64_sys_sendto (net/socket.c:2185)
>>> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
>>> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
>>>
>>> read to 0xffff888102474b74 of 4 bytes by interrupt on cpu 21:
>>> rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4397 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
>>> __napi_poll (net/core/dev.c:6527)
>>> net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
>>> __do_softirq (kernel/softirq.c:553)
>>> __irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
>>> irq_exit_rcu (kernel/softirq.c:647)
>>> common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
>>> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636)
>>> cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
>>> cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
>>> call_cpuidle (kernel/sched/idle.c:135)
>>> do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
>>> cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
>>> start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
>>> secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)
>>>
>>> value changed: 0x002f4815 -> 0x002f4816
>>>
>>> Reported by Kernel Concurrency Sanitizer on:
>>> CPU: 21 PID: 0 Comm: swapper/21 Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
>>> Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>>> ==================================================================
>>>
>>> The write side of drivers/net/ethernet/realtek/r8169_main.c is:
>>> ==================
>>> 4251 /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */
>>> 4252 smp_wmb();
>>> 4253
>>> → 4254 WRITE_ONCE(tp->cur_tx, tp->cur_tx + frags + 1);
>>> 4255
>>> 4256 stop_queue = !netif_subqueue_maybe_stop(dev, 0, rtl_tx_slots_avail(tp),
>>> 4257 R8169_TX_STOP_THRS,
>>> 4258 R8169_TX_START_THRS);
>>>
>>> The read side is the function rtl_tx():
>>>
>>> 4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
>>> 4356 int budget)
>>> 4357 {
>>> 4358 unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
>>> 4359 struct sk_buff *skb;
>>> 4360
>>> 4361 dirty_tx = tp->dirty_tx;
>>> 4362
>>> 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
>>> 4364 unsigned int entry = dirty_tx % NUM_TX_DESC;
>>> 4365 u32 status;
>>> 4366
>>> 4367 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
>>> 4368 if (status & DescOwn)
>>> 4369 break;
>>> 4370
>>> 4371 skb = tp->tx_skb[entry].skb;
>>> 4372 rtl8169_unmap_tx_skb(tp, entry);
>>> 4373
>>> 4374 if (skb) {
>>> 4375 pkts_compl++;
>>> 4376 bytes_compl += skb->len;
>>> 4377 napi_consume_skb(skb, budget);
>>> 4378 }
>>> 4379 dirty_tx++;
>>> 4380 }
>>> 4381
>>> 4382 if (tp->dirty_tx != dirty_tx) {
>>> 4383 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
>>> 4384 WRITE_ONCE(tp->dirty_tx, dirty_tx);
>>> 4385
>>> 4386 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
>>> 4387 rtl_tx_slots_avail(tp),
>>> 4388 R8169_TX_START_THRS);
>>> 4389 /*
>>> 4390 * 8168 hack: TxPoll requests are lost when the Tx packets are
>>> 4391 * too close. Let's kick an extra TxPoll request when a burst
>>> 4392 * of start_xmit activity is detected (if it is not detected,
>>> 4393 * it is slow enough). -- FR
>>> 4394 * If skb is NULL then we come here again once a tx irq is
>>> 4395 * triggered after the last fragment is marked transmitted.
>>> 4396 */
>>> → 4397 if (tp->cur_tx != dirty_tx && skb)
>>> 4398 rtl8169_doorbell(tp);
>>> 4399 }
>>> 4400 }
>>>
>>> Obviously from the code, an earlier detected data-race for tp->cur_tx was fixed in the
>>> line 4363:
>>>
>>> 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
>>>
>>> but the same solution is required for protecting the other access to tp->cur_tx:
>>>
>>> → 4397 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
>>> 4398 rtl8169_doorbell(tp);
>>>
>>> The write in the line 4254 is protected with WRITE_ONCE(), but the read in the line 4397
>>> might have suffered read tearing under some compiler optimisations.
>>>
>>> The fix eliminated the KCSAN data-race report for this bug.
>>>
>>> It is yet to be evaluated what happens if tp->cur_tx changes between the test in line 4363
>>> and line 4397. This test should certainly not be cached by the compiler in some register
>>> for such a long time, while asynchronous writes to tp->cur_tx might have occurred in line
>>> 4254 in the meantime.
>>>
>>
>> netif_subqueue_completed_wake() has barriers ensuring that no cached value for tp->cur_tx
>> is used in line 4397. I'm not aware of any reported issues with an obvious link to the
>> potentential issue you describe.
>> I don't have a strong opinion on these patches. They shouldn't hurt, and if they make
>> KCSAN happy, why not.
>
> Barries don't protect unmarked accesses from being miscompiled. So the
> use of barriers and marked accesses like READ_ONCE() is correct:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt
> That said, actually encountering a miscompilation depends on
> architecture and compiler. Using the right marked accesses just
> ensures things don't suddenly break because the compiler decided to be
> a little more clever.
>
>>> Fixes: 94d8a98e6235c ("r8169: reduce number of workaround doorbell rings")
>>> Cc: Heiner Kallweit <[email protected]>
>>> Cc: [email protected]
>>> Cc: "David S. Miller" <[email protected]>
>>> Cc: Eric Dumazet <[email protected]>
>>> Cc: Jakub Kicinski <[email protected]>
>>> Cc: Paolo Abeni <[email protected]>
>>> Cc: Marco Elver <[email protected]>
>>> Cc: [email protected]
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
>
> Acked-by: Marco Elver <[email protected]>

Hi, Marco,

Does this Acked-by: cover all of the [123]/3 in the patch series?

I guess I should resubmit the patches as the formal ones as patchwork will
not pick up a PATH RFC?

Thanks,
Mirsad Todorovac

>>> ---
>>> v1:
>>> the initial patch proposal. fixes the KCSAN warning.
>>>
>>> drivers/net/ethernet/realtek/r8169_main.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>> index 6351a2dc13bc..281aaa851847 100644
>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>> @@ -4394,7 +4394,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
>>> * If skb is NULL then we come here again once a tx irq is
>>> * triggered after the last fragment is marked transmitted.
>>> */
>>> - if (tp->cur_tx != dirty_tx && skb)
>>> + if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
>>> rtl8169_doorbell(tp);
>>> }
>>> }
>>

--
Mirsad Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu

System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia
tel. +385 (0)1 3711 451
mob. +385 91 57 88 355

2023-10-13 07:22:11

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH RFC v1 1/3] r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx

On Fri, 13 Oct 2023 at 07:01, Mirsad Goran Todorovac
<[email protected]> wrote:
>
> On 9/28/2023 8:02 AM, Marco Elver wrote:
> > On Wed, 27 Sept 2023 at 21:52, Heiner Kallweit <[email protected]> wrote:
> >>
> >> On 27.09.2023 20:41, Mirsad Goran Todorovac wrote:
> >>> KCSAN reported the following data-race:
> >>>
> >>> ==================================================================
> >>> BUG: KCSAN: data-race in rtl8169_poll [r8169] / rtl8169_start_xmit [r8169]
> >>>
> >>> write (marked) to 0xffff888102474b74 of 4 bytes by task 5358 on cpu 29:
> >>> rtl8169_start_xmit (drivers/net/ethernet/realtek/r8169_main.c:4254) r8169
> >>> dev_hard_start_xmit (./include/linux/netdevice.h:4889 ./include/linux/netdevice.h:4903 net/core/dev.c:3544 net/core/dev.c:3560)
> >>> sch_direct_xmit (net/sched/sch_generic.c:342)
> >>> __dev_queue_xmit (net/core/dev.c:3817 net/core/dev.c:4306)
> >>> ip_finish_output2 (./include/linux/netdevice.h:3082 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv4/ip_output.c:233)
> >>> __ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:293)
> >>> ip_finish_output (net/ipv4/ip_output.c:328)
> >>> ip_output (net/ipv4/ip_output.c:435)
> >>> ip_send_skb (./include/net/dst.h:458 net/ipv4/ip_output.c:127 net/ipv4/ip_output.c:1486)
> >>> udp_send_skb (net/ipv4/udp.c:963)
> >>> udp_sendmsg (net/ipv4/udp.c:1246)
> >>> inet_sendmsg (net/ipv4/af_inet.c:840 (discriminator 4))
> >>> sock_sendmsg (net/socket.c:730 net/socket.c:753)
> >>> __sys_sendto (net/socket.c:2177)
> >>> __x64_sys_sendto (net/socket.c:2185)
> >>> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> >>> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> >>>
> >>> read to 0xffff888102474b74 of 4 bytes by interrupt on cpu 21:
> >>> rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4397 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
> >>> __napi_poll (net/core/dev.c:6527)
> >>> net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
> >>> __do_softirq (kernel/softirq.c:553)
> >>> __irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
> >>> irq_exit_rcu (kernel/softirq.c:647)
> >>> common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
> >>> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636)
> >>> cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
> >>> cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
> >>> call_cpuidle (kernel/sched/idle.c:135)
> >>> do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
> >>> cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
> >>> start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
> >>> secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)
> >>>
> >>> value changed: 0x002f4815 -> 0x002f4816
> >>>
> >>> Reported by Kernel Concurrency Sanitizer on:
> >>> CPU: 21 PID: 0 Comm: swapper/21 Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
> >>> Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
> >>> ==================================================================
> >>>
> >>> The write side of drivers/net/ethernet/realtek/r8169_main.c is:
> >>> ==================
> >>> 4251 /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */
> >>> 4252 smp_wmb();
> >>> 4253
> >>> → 4254 WRITE_ONCE(tp->cur_tx, tp->cur_tx + frags + 1);
> >>> 4255
> >>> 4256 stop_queue = !netif_subqueue_maybe_stop(dev, 0, rtl_tx_slots_avail(tp),
> >>> 4257 R8169_TX_STOP_THRS,
> >>> 4258 R8169_TX_START_THRS);
> >>>
> >>> The read side is the function rtl_tx():
> >>>
> >>> 4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
> >>> 4356 int budget)
> >>> 4357 {
> >>> 4358 unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
> >>> 4359 struct sk_buff *skb;
> >>> 4360
> >>> 4361 dirty_tx = tp->dirty_tx;
> >>> 4362
> >>> 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
> >>> 4364 unsigned int entry = dirty_tx % NUM_TX_DESC;
> >>> 4365 u32 status;
> >>> 4366
> >>> 4367 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
> >>> 4368 if (status & DescOwn)
> >>> 4369 break;
> >>> 4370
> >>> 4371 skb = tp->tx_skb[entry].skb;
> >>> 4372 rtl8169_unmap_tx_skb(tp, entry);
> >>> 4373
> >>> 4374 if (skb) {
> >>> 4375 pkts_compl++;
> >>> 4376 bytes_compl += skb->len;
> >>> 4377 napi_consume_skb(skb, budget);
> >>> 4378 }
> >>> 4379 dirty_tx++;
> >>> 4380 }
> >>> 4381
> >>> 4382 if (tp->dirty_tx != dirty_tx) {
> >>> 4383 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
> >>> 4384 WRITE_ONCE(tp->dirty_tx, dirty_tx);
> >>> 4385
> >>> 4386 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
> >>> 4387 rtl_tx_slots_avail(tp),
> >>> 4388 R8169_TX_START_THRS);
> >>> 4389 /*
> >>> 4390 * 8168 hack: TxPoll requests are lost when the Tx packets are
> >>> 4391 * too close. Let's kick an extra TxPoll request when a burst
> >>> 4392 * of start_xmit activity is detected (if it is not detected,
> >>> 4393 * it is slow enough). -- FR
> >>> 4394 * If skb is NULL then we come here again once a tx irq is
> >>> 4395 * triggered after the last fragment is marked transmitted.
> >>> 4396 */
> >>> → 4397 if (tp->cur_tx != dirty_tx && skb)
> >>> 4398 rtl8169_doorbell(tp);
> >>> 4399 }
> >>> 4400 }
> >>>
> >>> Obviously from the code, an earlier detected data-race for tp->cur_tx was fixed in the
> >>> line 4363:
> >>>
> >>> 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
> >>>
> >>> but the same solution is required for protecting the other access to tp->cur_tx:
> >>>
> >>> → 4397 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
> >>> 4398 rtl8169_doorbell(tp);
> >>>
> >>> The write in the line 4254 is protected with WRITE_ONCE(), but the read in the line 4397
> >>> might have suffered read tearing under some compiler optimisations.
> >>>
> >>> The fix eliminated the KCSAN data-race report for this bug.
> >>>
> >>> It is yet to be evaluated what happens if tp->cur_tx changes between the test in line 4363
> >>> and line 4397. This test should certainly not be cached by the compiler in some register
> >>> for such a long time, while asynchronous writes to tp->cur_tx might have occurred in line
> >>> 4254 in the meantime.
> >>>
> >>
> >> netif_subqueue_completed_wake() has barriers ensuring that no cached value for tp->cur_tx
> >> is used in line 4397. I'm not aware of any reported issues with an obvious link to the
> >> potentential issue you describe.
> >> I don't have a strong opinion on these patches. They shouldn't hurt, and if they make
> >> KCSAN happy, why not.
> >
> > Barries don't protect unmarked accesses from being miscompiled. So the
> > use of barriers and marked accesses like READ_ONCE() is correct:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt
> > That said, actually encountering a miscompilation depends on
> > architecture and compiler. Using the right marked accesses just
> > ensures things don't suddenly break because the compiler decided to be
> > a little more clever.
> >
> >>> Fixes: 94d8a98e6235c ("r8169: reduce number of workaround doorbell rings")
> >>> Cc: Heiner Kallweit <[email protected]>
> >>> Cc: [email protected]
> >>> Cc: "David S. Miller" <[email protected]>
> >>> Cc: Eric Dumazet <[email protected]>
> >>> Cc: Jakub Kicinski <[email protected]>
> >>> Cc: Paolo Abeni <[email protected]>
> >>> Cc: Marco Elver <[email protected]>
> >>> Cc: [email protected]
> >>> Link: https://lore.kernel.org/lkml/[email protected]/
> >>> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
> >
> > Acked-by: Marco Elver <[email protected]>
>
> Hi, Marco,
>
> Does this Acked-by: cover all of the [123]/3 in the patch series?

Technically, no. (Only if this was the cover lettter i.e. 0/n)

Send the non-RFC separately and I'll ack them.

2023-10-22 17:24:46

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: [PATCH RFC v1 1/3] r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx

On 9/27/23 21:52, Heiner Kallweit wrote:
> On 27.09.2023 20:41, Mirsad Goran Todorovac wrote:
>> KCSAN reported the following data-race:
>>
>> ==================================================================
>> BUG: KCSAN: data-race in rtl8169_poll [r8169] / rtl8169_start_xmit [r8169]
>>
>> write (marked) to 0xffff888102474b74 of 4 bytes by task 5358 on cpu 29:
>> rtl8169_start_xmit (drivers/net/ethernet/realtek/r8169_main.c:4254) r8169
>> dev_hard_start_xmit (./include/linux/netdevice.h:4889 ./include/linux/netdevice.h:4903 net/core/dev.c:3544 net/core/dev.c:3560)
>> sch_direct_xmit (net/sched/sch_generic.c:342)
>> __dev_queue_xmit (net/core/dev.c:3817 net/core/dev.c:4306)
>> ip_finish_output2 (./include/linux/netdevice.h:3082 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv4/ip_output.c:233)
>> __ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:293)
>> ip_finish_output (net/ipv4/ip_output.c:328)
>> ip_output (net/ipv4/ip_output.c:435)
>> ip_send_skb (./include/net/dst.h:458 net/ipv4/ip_output.c:127 net/ipv4/ip_output.c:1486)
>> udp_send_skb (net/ipv4/udp.c:963)
>> udp_sendmsg (net/ipv4/udp.c:1246)
>> inet_sendmsg (net/ipv4/af_inet.c:840 (discriminator 4))
>> sock_sendmsg (net/socket.c:730 net/socket.c:753)
>> __sys_sendto (net/socket.c:2177)
>> __x64_sys_sendto (net/socket.c:2185)
>> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
>> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
>>
>> read to 0xffff888102474b74 of 4 bytes by interrupt on cpu 21:
>> rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4397 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
>> __napi_poll (net/core/dev.c:6527)
>> net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
>> __do_softirq (kernel/softirq.c:553)
>> __irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
>> irq_exit_rcu (kernel/softirq.c:647)
>> common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
>> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636)
>> cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
>> cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
>> call_cpuidle (kernel/sched/idle.c:135)
>> do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
>> cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
>> start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
>> secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)
>>
>> value changed: 0x002f4815 -> 0x002f4816
>>
>> Reported by Kernel Concurrency Sanitizer on:
>> CPU: 21 PID: 0 Comm: swapper/21 Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
>> Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>> ==================================================================
>>
>> The write side of drivers/net/ethernet/realtek/r8169_main.c is:
>> ==================
>> 4251 /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */
>> 4252 smp_wmb();
>> 4253
>> → 4254 WRITE_ONCE(tp->cur_tx, tp->cur_tx + frags + 1);
>> 4255
>> 4256 stop_queue = !netif_subqueue_maybe_stop(dev, 0, rtl_tx_slots_avail(tp),
>> 4257 R8169_TX_STOP_THRS,
>> 4258 R8169_TX_START_THRS);
>>
>> The read side is the function rtl_tx():
>>
>> 4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
>> 4356 int budget)
>> 4357 {
>> 4358 unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
>> 4359 struct sk_buff *skb;
>> 4360
>> 4361 dirty_tx = tp->dirty_tx;
>> 4362
>> 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
>> 4364 unsigned int entry = dirty_tx % NUM_TX_DESC;
>> 4365 u32 status;
>> 4366
>> 4367 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
>> 4368 if (status & DescOwn)
>> 4369 break;
>> 4370
>> 4371 skb = tp->tx_skb[entry].skb;
>> 4372 rtl8169_unmap_tx_skb(tp, entry);
>> 4373
>> 4374 if (skb) {
>> 4375 pkts_compl++;
>> 4376 bytes_compl += skb->len;
>> 4377 napi_consume_skb(skb, budget);
>> 4378 }
>> 4379 dirty_tx++;
>> 4380 }
>> 4381
>> 4382 if (tp->dirty_tx != dirty_tx) {
>> 4383 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
>> 4384 WRITE_ONCE(tp->dirty_tx, dirty_tx);
>> 4385
>> 4386 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
>> 4387 rtl_tx_slots_avail(tp),
>> 4388 R8169_TX_START_THRS);
>> 4389 /*
>> 4390 * 8168 hack: TxPoll requests are lost when the Tx packets are
>> 4391 * too close. Let's kick an extra TxPoll request when a burst
>> 4392 * of start_xmit activity is detected (if it is not detected,
>> 4393 * it is slow enough). -- FR
>> 4394 * If skb is NULL then we come here again once a tx irq is
>> 4395 * triggered after the last fragment is marked transmitted.
>> 4396 */
>> → 4397 if (tp->cur_tx != dirty_tx && skb)
>> 4398 rtl8169_doorbell(tp);
>> 4399 }
>> 4400 }
>>
>> Obviously from the code, an earlier detected data-race for tp->cur_tx was fixed in the
>> line 4363:
>>
>> 4363 while (READ_ONCE(tp->cur_tx) != dirty_tx) {
>>
>> but the same solution is required for protecting the other access to tp->cur_tx:
>>
>> → 4397 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
>> 4398 rtl8169_doorbell(tp);
>>
>> The write in the line 4254 is protected with WRITE_ONCE(), but the read in the line 4397
>> might have suffered read tearing under some compiler optimisations.
>>
>> The fix eliminated the KCSAN data-race report for this bug.
>>
>> It is yet to be evaluated what happens if tp->cur_tx changes between the test in line 4363
>> and line 4397. This test should certainly not be cached by the compiler in some register
>> for such a long time, while asynchronous writes to tp->cur_tx might have occurred in line
>> 4254 in the meantime.

> netif_subqueue_completed_wake() has barriers ensuring that no cached value for tp->cur_tx
> is used in line 4397. I'm not aware of any reported issues with an obvious link to the
> potentential issue you describe.
> I don't have a strong opinion on these patches. They shouldn't hurt, and if they make
> KCSAN happy, why not.

Hi, Mr. Kallweit,

Your words have been taken with great weight and serious concern on my behalf, so I took
some time to investigate (even after the patches were Ack-ed).

The question is whether we deal with a data-race or a false positive reported by KCSAN.

I can indeed independently confirm that netif_subqueue_completed_wake() eventually expands to
call to:

include/net/netdev_queues.h:
==============================================
/* Variant of netdev_tx_completed_queue() which guarantees smp_mb() if
* @bytes != 0, regardless of kernel config.
*/
static inline void
netdev_txq_completed_mb(struct netdev_queue *dev_queue,
unsigned int pkts, unsigned int bytes)
{
if (IS_ENABLED(CONFIG_BQL))
netdev_tx_completed_queue(dev_queue, pkts, bytes);
else if (bytes)
smp_mb();
}

Obviously, smp_mb() is here.

On my __x86_64__ it expands to an ugly hack:

#define smp_mb() asm volatile("lock; addl $0,-132(%%rsp)" ::: "memory", "cc")

Thus far I concur with your findings.

But I see also Mr. Elver's point:

1. With the reference to: [1] https://www.kernel.org/doc/html/v6.2/core-api/wrappers/memory-barriers.html
compiler can do some "smart" optimisations that only work on single core, single threaded use.

What comes to mind is that "struct rtl8169_private" and its members might not be on the same cache line
with the pointer struct rtl8169_private *tp.

By the exposed matter in [1], READ_ONCE() provides a way to explain to the compiler that tp->cur_tx can
be concurrently modified by the other thread/core.

To be honest, I haven't caught a network card error or a driver bug either, but only the KCSAN complaint.

It would seem prudent to protect all the variables with the concurrent accesses from several cores/threads
or devices through DMA with (READ|WRITE)_ONCE.

Especially since I don't see where the callback added in drivers/net/ethernet/realtek/r8169_main.c:5277
is called from:

netif_napi_add(dev, &tp->napi, rtl8169_poll);

I am not trying to be an arbiter nor an authority on the rtl8169 driver, which would obviously be you.

However, I feel compelled by something to review all the accesses to tp->cur_tx, tp->cur_tx and other
structure members against concurrent access, and systematically. With your permission, of course.

I would appreciate your mentoring on the task if you are interested.

Mainly, I figured out that in:

4575 static int rtl8169_poll(struct napi_struct *napi, int budget)
4576 {
4577 struct rtl8169_private *tp = container_of(napi, struct rtl8169_private, napi);
4578 struct net_device *dev = tp->dev;
4579 int work_done;
4580
4581 rtl_tx(dev, tp, budget);
4582
4583 work_done = rtl_rx(dev, tp, budget);
4584
4585 if (work_done < budget && napi_complete_done(napi, work_done))
4586 rtl_irq_enable(tp);
4587
4588 return work_done;
4589 }

rtl_tx() and rtl_rx() are called in sequence. Is it possible to imagine doing it in parallel?

Can the Realtek NIC hardware support that at all and which generations can?

(Of course, Ethernet might be full-duplex, but PCI Express line might not be.)

[2] https://www.rambus.com/blogs/pci-express-4/ Full-duplex PCI 4.0 specs.

Thank you.

Best regards,
Mirsad Todorovac

>> Fixes: 94d8a98e6235c ("r8169: reduce number of workaround doorbell rings")
>> Cc: Heiner Kallweit <[email protected]>
>> Cc: [email protected]
>> Cc: "David S. Miller" <[email protected]>
>> Cc: Eric Dumazet <[email protected]>
>> Cc: Jakub Kicinski <[email protected]>
>> Cc: Paolo Abeni <[email protected]>
>> Cc: Marco Elver <[email protected]>
>> Cc: [email protected]
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
>> ---
>> v1:
>> the initial patch proposal. fixes the KCSAN warning.
>>
>> drivers/net/ethernet/realtek/r8169_main.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>> index 6351a2dc13bc..281aaa851847 100644
>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -4394,7 +4394,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
>> * If skb is NULL then we come here again once a tx irq is
>> * triggered after the last fragment is marked transmitted.
>> */
>> - if (tp->cur_tx != dirty_tx && skb)
>> + if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
>> rtl8169_doorbell(tp);
>> }
>> }

2023-10-22 21:33:33

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH RFC v1 1/3] r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx

On 22.10.2023 19:24, Mirsad Todorovac wrote:
> On 9/27/23 21:52, Heiner Kallweit wrote:
>> On 27.09.2023 20:41, Mirsad Goran Todorovac wrote:
>>> KCSAN reported the following data-race:
>>>
>>> ==================================================================
>>> BUG: KCSAN: data-race in rtl8169_poll [r8169] / rtl8169_start_xmit [r8169]
>>>
>>> write (marked) to 0xffff888102474b74 of 4 bytes by task 5358 on cpu 29:
>>> rtl8169_start_xmit (drivers/net/ethernet/realtek/r8169_main.c:4254) r8169
>>> dev_hard_start_xmit (./include/linux/netdevice.h:4889 ./include/linux/netdevice.h:4903 net/core/dev.c:3544 net/core/dev.c:3560)
>>> sch_direct_xmit (net/sched/sch_generic.c:342)
>>> __dev_queue_xmit (net/core/dev.c:3817 net/core/dev.c:4306)
>>> ip_finish_output2 (./include/linux/netdevice.h:3082 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv4/ip_output.c:233)
>>> __ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:293)
>>> ip_finish_output (net/ipv4/ip_output.c:328)
>>> ip_output (net/ipv4/ip_output.c:435)
>>> ip_send_skb (./include/net/dst.h:458 net/ipv4/ip_output.c:127 net/ipv4/ip_output.c:1486)
>>> udp_send_skb (net/ipv4/udp.c:963)
>>> udp_sendmsg (net/ipv4/udp.c:1246)
>>> inet_sendmsg (net/ipv4/af_inet.c:840 (discriminator 4))
>>> sock_sendmsg (net/socket.c:730 net/socket.c:753)
>>> __sys_sendto (net/socket.c:2177)
>>> __x64_sys_sendto (net/socket.c:2185)
>>> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
>>> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
>>>
>>> read to 0xffff888102474b74 of 4 bytes by interrupt on cpu 21:
>>> rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4397 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
>>> __napi_poll (net/core/dev.c:6527)
>>> net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
>>> __do_softirq (kernel/softirq.c:553)
>>> __irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
>>> irq_exit_rcu (kernel/softirq.c:647)
>>> common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
>>> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636)
>>> cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
>>> cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
>>> call_cpuidle (kernel/sched/idle.c:135)
>>> do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
>>> cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
>>> start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
>>> secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)
>>>
>>> value changed: 0x002f4815 -> 0x002f4816
>>>
>>> Reported by Kernel Concurrency Sanitizer on:
>>> CPU: 21 PID: 0 Comm: swapper/21 Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
>>> Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>>> ==================================================================
>>>
>>> The write side of drivers/net/ethernet/realtek/r8169_main.c is:
>>> ==================
>>>     4251         /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */
>>>     4252         smp_wmb();
>>>     4253
>>>   → 4254         WRITE_ONCE(tp->cur_tx, tp->cur_tx + frags + 1);
>>>     4255
>>>     4256         stop_queue = !netif_subqueue_maybe_stop(dev, 0, rtl_tx_slots_avail(tp),
>>>     4257                                                 R8169_TX_STOP_THRS,
>>>     4258                                                 R8169_TX_START_THRS);
>>>
>>> The read side is the function rtl_tx():
>>>
>>>     4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
>>>     4356                    int budget)
>>>     4357 {
>>>     4358         unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
>>>     4359         struct sk_buff *skb;
>>>     4360
>>>     4361         dirty_tx = tp->dirty_tx;
>>>     4362
>>>     4363         while (READ_ONCE(tp->cur_tx) != dirty_tx) {
>>>     4364                 unsigned int entry = dirty_tx % NUM_TX_DESC;
>>>     4365                 u32 status;
>>>     4366
>>>     4367                 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
>>>     4368                 if (status & DescOwn)
>>>     4369                         break;
>>>     4370
>>>     4371                 skb = tp->tx_skb[entry].skb;
>>>     4372                 rtl8169_unmap_tx_skb(tp, entry);
>>>     4373
>>>     4374                 if (skb) {
>>>     4375                         pkts_compl++;
>>>     4376                         bytes_compl += skb->len;
>>>     4377                         napi_consume_skb(skb, budget);
>>>     4378                 }
>>>     4379                 dirty_tx++;
>>>     4380         }
>>>     4381
>>>     4382         if (tp->dirty_tx != dirty_tx) {
>>>     4383                 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
>>>     4384                 WRITE_ONCE(tp->dirty_tx, dirty_tx);
>>>     4385
>>>     4386                 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
>>>     4387                                               rtl_tx_slots_avail(tp),
>>>     4388                                               R8169_TX_START_THRS);
>>>     4389                 /*
>>>     4390                  * 8168 hack: TxPoll requests are lost when the Tx packets are
>>>     4391                  * too close. Let's kick an extra TxPoll request when a burst
>>>     4392                  * of start_xmit activity is detected (if it is not detected,
>>>     4393                  * it is slow enough). -- FR
>>>     4394                  * If skb is NULL then we come here again once a tx irq is
>>>     4395                  * triggered after the last fragment is marked transmitted.
>>>     4396                  */
>>>   → 4397                 if (tp->cur_tx != dirty_tx && skb)
>>>     4398                         rtl8169_doorbell(tp);
>>>     4399         }
>>>     4400 }
>>>
>>> Obviously from the code, an earlier detected data-race for tp->cur_tx was fixed in the
>>> line 4363:
>>>
>>>     4363         while (READ_ONCE(tp->cur_tx) != dirty_tx) {
>>>
>>> but the same solution is required for protecting the other access to tp->cur_tx:
>>>
>>>   → 4397                 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
>>>     4398                         rtl8169_doorbell(tp);
>>>
>>> The write in the line 4254 is protected with WRITE_ONCE(), but the read in the line 4397
>>> might have suffered read tearing under some compiler optimisations.
>>>
>>> The fix eliminated the KCSAN data-race report for this bug.
>>>
>>> It is yet to be evaluated what happens if tp->cur_tx changes between the test in line 4363
>>> and line 4397. This test should certainly not be cached by the compiler in some register
>>> for such a long time, while asynchronous writes to tp->cur_tx might have occurred in line
>>> 4254 in the meantime.
>
>> netif_subqueue_completed_wake() has barriers ensuring that no cached value for tp->cur_tx
>> is used in line 4397. I'm not aware of any reported issues with an obvious link to the
>> potentential issue you describe.
>> I don't have a strong opinion on these patches. They shouldn't hurt, and if they make
>> KCSAN happy, why not.
>
> Hi, Mr. Kallweit,
>
> Your words have been taken with great weight and serious concern on my behalf, so I took
> some time to investigate (even after the patches were Ack-ed).
>
No need to be so formal. I'm just a hobbyist developer who once suddenly
found himself in a maintainer position ;)

> The question is whether we deal with a data-race or a false positive reported by KCSAN.
>
> I can indeed independently confirm that netif_subqueue_completed_wake() eventually expands to
> call to:
>
> include/net/netdev_queues.h:
> ==============================================
> /* Variant of netdev_tx_completed_queue() which guarantees smp_mb() if
>  * @bytes != 0, regardless of kernel config.
>  */
> static inline void
> netdev_txq_completed_mb(struct netdev_queue *dev_queue,
>             unsigned int pkts, unsigned int bytes)
> {
>     if (IS_ENABLED(CONFIG_BQL))
>         netdev_tx_completed_queue(dev_queue, pkts, bytes);
>     else if (bytes)
>         smp_mb();
> }
>
> Obviously, smp_mb() is here.
>
> On my __x86_64__ it expands to an ugly hack:
>
>         #define smp_mb()  asm volatile("lock; addl $0,-132(%%rsp)" ::: "memory", "cc")
>
> Thus far I concur with your findings.
>
> But I see also Mr. Elver's point:
>
> 1. With the reference to: [1] https://www.kernel.org/doc/html/v6.2/core-api/wrappers/memory-barriers.html
> compiler can do some "smart" optimisations that only work on single core, single threaded use.
>
> What comes to mind is that "struct rtl8169_private" and its members might not be on the same cache line
> with the pointer struct rtl8169_private *tp.
>
> By the exposed matter in [1], READ_ONCE() provides a way to explain to the compiler that tp->cur_tx can
> be concurrently modified by the other thread/core.
>
> To be honest, I haven't caught a network card error or a driver bug either, but only the KCSAN complaint.
>
> It would seem prudent to protect all the variables with the concurrent accesses from several cores/threads
> or devices through DMA with (READ|WRITE)_ONCE.
>
r8169 driver supports one queue only. NAPI softirq guarantees that calls to rtl8169_poll
are serialized. Threaded NAPI may be a different story, however I'm not aware of any bug
report related to threaded NAPI.

> Especially since I don't see where the callback added in drivers/net/ethernet/realtek/r8169_main.c:5277
> is called from:
>
>          netif_napi_add(dev, &tp->napi, rtl8169_poll);
>
> I am not trying to be an arbiter nor an authority on the rtl8169 driver, which would obviously be you.
>
> However, I feel compelled by something to review all the accesses to tp->cur_tx, tp->cur_tx and other
> structure members against concurrent access, and systematically. With your permission, of course.
>
> I would appreciate your mentoring on the task if you are interested.
>
> Mainly, I figured out that in:
>
> 4575 static int rtl8169_poll(struct napi_struct *napi, int budget)
> 4576 {
> 4577         struct rtl8169_private *tp = container_of(napi, struct rtl8169_private, napi);
> 4578         struct net_device *dev = tp->dev;
> 4579         int work_done;
> 4580
> 4581         rtl_tx(dev, tp, budget);
> 4582
> 4583         work_done = rtl_rx(dev, tp, budget);
> 4584
> 4585         if (work_done < budget && napi_complete_done(napi, work_done))
> 4586                 rtl_irq_enable(tp);
> 4587
> 4588         return work_done;
> 4589 }
>
> rtl_tx() and rtl_rx() are called in sequence. Is it possible to imagine doing it in parallel?
>
Adding a separated TX NAPI is possible, question is whether it's worth the
effort. Look for users of netif_napi_add_tx() to get an idea of separated TX NAPI.
I once experimented with it in context of r8169 and didn't really find a benefit.

> Can the Realtek NIC hardware support that at all and which generations can?
>
Well, major issue with Realtek HW is that there's no public datasheets.
Only source of information is their out-of-tree drivers (r8168, r8101, r8125).

With regard to RSS: r8125 supports it, r8168 not.

> (Of course, Ethernet might be full-duplex, but PCI Express line might not be.)
>
> [2] https://www.rambus.com/blogs/pci-express-4/ Full-duplex PCI 4.0 specs.
>
> Thank you.
>
> Best regards,
> Mirsad Todorovac
>
>>> Fixes: 94d8a98e6235c ("r8169: reduce number of workaround doorbell rings")
>>> Cc: Heiner Kallweit <[email protected]>
>>> Cc: [email protected]
>>> Cc: "David S. Miller" <[email protected]>
>>> Cc: Eric Dumazet <[email protected]>
>>> Cc: Jakub Kicinski <[email protected]>
>>> Cc: Paolo Abeni <[email protected]>
>>> Cc: Marco Elver <[email protected]>
>>> Cc: [email protected]
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Signed-off-by: Mirsad Goran Todorovac <[email protected]>
>>> ---
>>> v1:
>>>   the initial patch proposal. fixes the KCSAN warning.
>>>
>>>   drivers/net/ethernet/realtek/r8169_main.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>> index 6351a2dc13bc..281aaa851847 100644
>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>> @@ -4394,7 +4394,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
>>>            * If skb is NULL then we come here again once a tx irq is
>>>            * triggered after the last fragment is marked transmitted.
>>>            */
>>> -        if (tp->cur_tx != dirty_tx && skb)
>>> +        if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
>>>               rtl8169_doorbell(tp);
>>>       }
>>>   }