LinuxLists.cc - [PATCH iwl-net 0/3] ice: fix synchronization between .ndo

2024-06-10 15:45:15

Subject: [PATCH iwl-net 0/3] ice: fix synchronization between .ndo_bpf() and reset

Fix the problems that are triggered by tx_timeout and ice_xdp() calls,
including both pool and program operations.

PF reset can be triggered asynchronously, e.g. by tx_timeout. With some
unfortunate timings both reset and .ndo_bpf will try to access and modify
XDP rings at the same time, causing system crash, such as the one below:

[ +1.999878] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 14
[ +2.002992] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 18
[Mar15 18:17] ice 0000:b1:00.0 ens801f0np0: NETDEV WATCHDOG: CPU: 38: transmit queue 14 timed out 80692736 ms
[ +0.000093] ice 0000:b1:00.0 ens801f0np0: tx_timeout: VSI_num: 6, Q 14, NTC: 0x0, HW_HEAD: 0x0, NTU: 0x0, INT: 0x4000001
[ +0.000012] ice 0000:b1:00.0 ens801f0np0: tx_timeout recovery level 1, txqueue 14
[ +0.394718] ice 0000:b1:00.0: PTP reset successful
[ +0.006184] BUG: kernel NULL pointer dereference, address: 0000000000000098
[ +0.000045] #PF: supervisor read access in kernel mode
[ +0.000023] #PF: error_code(0x0000) - not-present page
[ +0.000023] PGD 0 P4D 0
[ +0.000018] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ +0.000023] CPU: 38 PID: 7540 Comm: kworker/38:1 Not tainted 6.8.0-rc7 #1
[ +0.000031] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[ +0.000036] Workqueue: ice ice_service_task [ice]
[ +0.000183] RIP: 0010:ice_clean_tx_ring+0xa/0xd0 [ice]
[...]
[ +0.000013] Call Trace:
[ +0.000016] <TASK>
[ +0.000014] ? __die+0x1f/0x70
[ +0.000029] ? page_fault_oops+0x171/0x4f0
[ +0.000029] ? schedule+0x3b/0xd0
[ +0.000027] ? exc_page_fault+0x7b/0x180
[ +0.000022] ? asm_exc_page_fault+0x22/0x30
[ +0.000031] ? ice_clean_tx_ring+0xa/0xd0 [ice]
[ +0.000194] ice_free_tx_ring+0xe/0x60 [ice]
[ +0.000186] ice_destroy_xdp_rings+0x157/0x310 [ice]
[ +0.000151] ice_vsi_decfg+0x53/0xe0 [ice]
[ +0.000180] ice_vsi_rebuild+0x239/0x540 [ice]
[ +0.000186] ice_vsi_rebuild_by_type+0x76/0x180 [ice]
[ +0.000145] ice_rebuild+0x18c/0x840 [ice]
[ +0.000145] ? delay_tsc+0x4a/0xc0
[ +0.000022] ? delay_tsc+0x92/0xc0
[ +0.000020] ice_do_reset+0x140/0x180 [ice]
[ +0.000886] ice_service_task+0x404/0x1030 [ice]
[ +0.000824] process_one_work+0x171/0x340
[ +0.000685] worker_thread+0x277/0x3a0
[ +0.000675] ? preempt_count_add+0x6a/0xa0
[ +0.000677] ? _raw_spin_lock_irqsave+0x23/0x50
[ +0.000679] ? __pfx_worker_thread+0x10/0x10
[ +0.000653] kthread+0xf0/0x120
[ +0.000635] ? __pfx_kthread+0x10/0x10
[ +0.000616] ret_from_fork+0x2d/0x50
[ +0.000612] ? __pfx_kthread+0x10/0x10
[ +0.000604] ret_from_fork_asm+0x1b/0x30
[ +0.000604] </TASK>

Larysa Zaremba (3):
ice: synchronize XDP setup with reset
ice: fix locking in ice_xsk_pool_setup()
ice: make NAPI setting code aware that rtnl-locked request is waiting

drivers/net/ethernet/intel/ice/ice.h | 2 ++
drivers/net/ethernet/intel/ice/ice_lib.c | 23 ++++++++++---
drivers/net/ethernet/intel/ice/ice_main.c | 39 ++++++++++++++++++++---
drivers/net/ethernet/intel/ice/ice_xsk.c | 12 ++-----
4 files changed, 57 insertions(+), 19 deletions(-)

--
2.43.0

2024-06-10 15:45:40

by Larysa Zaremba

[permalink] [raw]

Subject: [PATCH iwl-net 1/3] ice: synchronize XDP setup with reset

XDP setup and PF reset code access the same resources in the following
sections:
* ice_vsi_close() in ice_prepare_for_reset() - already rtnl-locked
* ice_vsi_rebuild() for the PF VSI - not protected

With an unfortunate timing, such accesses can result in a crash such as the
one below:

[ +1.999878] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 14
[ +2.002992] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 18
[Mar15 18:17] ice 0000:b1:00.0 ens801f0np0: NETDEV WATCHDOG: CPU: 38: transmit queue 14 timed out 80692736 ms
[ +0.000093] ice 0000:b1:00.0 ens801f0np0: tx_timeout: VSI_num: 6, Q 14, NTC: 0x0, HW_HEAD: 0x0, NTU: 0x0, INT: 0x4000001
[ +0.000012] ice 0000:b1:00.0 ens801f0np0: tx_timeout recovery level 1, txqueue 14
[ +0.394718] ice 0000:b1:00.0: PTP reset successful
[ +0.006184] BUG: kernel NULL pointer dereference, address: 0000000000000098
[ +0.000045] #PF: supervisor read access in kernel mode
[ +0.000023] #PF: error_code(0x0000) - not-present page
[ +0.000023] PGD 0 P4D 0
[ +0.000018] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ +0.000023] CPU: 38 PID: 7540 Comm: kworker/38:1 Not tainted 6.8.0-rc7 #1
[ +0.000031] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[ +0.000036] Workqueue: ice ice_service_task [ice]
[ +0.000183] RIP: 0010:ice_clean_tx_ring+0xa/0xd0 [ice]
[...]
[ +0.000013] Call Trace:
[ +0.000016] <TASK>
[ +0.000014] ? __die+0x1f/0x70
[ +0.000029] ? page_fault_oops+0x171/0x4f0
[ +0.000029] ? schedule+0x3b/0xd0
[ +0.000027] ? exc_page_fault+0x7b/0x180
[ +0.000022] ? asm_exc_page_fault+0x22/0x30
[ +0.000031] ? ice_clean_tx_ring+0xa/0xd0 [ice]
[ +0.000194] ice_free_tx_ring+0xe/0x60 [ice]
[ +0.000186] ice_destroy_xdp_rings+0x157/0x310 [ice]
[ +0.000151] ice_vsi_decfg+0x53/0xe0 [ice]
[ +0.000180] ice_vsi_rebuild+0x239/0x540 [ice]
[ +0.000186] ice_vsi_rebuild_by_type+0x76/0x180 [ice]
[ +0.000145] ice_rebuild+0x18c/0x840 [ice]
[ +0.000145] ? delay_tsc+0x4a/0xc0
[ +0.000022] ? delay_tsc+0x92/0xc0
[ +0.000020] ice_do_reset+0x140/0x180 [ice]
[ +0.000886] ice_service_task+0x404/0x1030 [ice]
[ +0.000824] process_one_work+0x171/0x340
[ +0.000685] worker_thread+0x277/0x3a0
[ +0.000675] ? preempt_count_add+0x6a/0xa0
[ +0.000677] ? _raw_spin_lock_irqsave+0x23/0x50
[ +0.000679] ? __pfx_worker_thread+0x10/0x10
[ +0.000653] kthread+0xf0/0x120
[ +0.000635] ? __pfx_kthread+0x10/0x10
[ +0.000616] ret_from_fork+0x2d/0x50
[ +0.000612] ? __pfx_kthread+0x10/0x10
[ +0.000604] ret_from_fork_asm+0x1b/0x30
[ +0.000604] </TASK>

The previous way of handling this through returning -EBUSY is not viable,
particularly when destroying AF_XDP socket, because the kernel proceeds
with removal anyway.

There is plenty of code between those calls and there is no need to create
a large critical section that covers them both, same as there is no need to
protect ice_vsi_rebuild() with rtnl_lock().

Leaving an unprotexted section in between would result in a state, when the
VSI is closed, but not yet rebuild, such situation can be handled pretty
easily.

Lock ice_vsi_rebuild() with ICE_CFG_BUSY flag. Particularly, to prevent
system crash, when tx_timeout and .ndo_bpf() happen at the same time.
Also, handle the state between critical sections by skipping XDP ring
configuration.

Fixes: efc2214b6047 ("ice: Add support for XDP")
Reviewed-by: Igor Bagnucki <[email protected]>
Signed-off-by: Larysa Zaremba <[email protected]>
---
drivers/net/ethernet/intel/ice/ice_lib.c | 5 +++-
drivers/net/ethernet/intel/ice/ice_main.c | 36 +++++++++++++++++++----
2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 7629b0190578..4774bcc4d5a8 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2426,7 +2426,10 @@ void ice_vsi_decfg(struct ice_vsi *vsi)
dev_err(ice_pf_to_dev(pf), "Failed to remove RDMA scheduler config for VSI %u, err %d\n",
vsi->vsi_num, err);

- if (ice_is_xdp_ena_vsi(vsi))
+ /* xdp_rings can be absent, if program was attached amid reset,
+ * VSI rebuild is supposed to create them later
+ */
+ if (ice_is_xdp_ena_vsi(vsi) && vsi->xdp_rings)
/* return value check can be skipped here, it always returns
* 0 if reset is in progress
*/
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 15a6805ac2a1..dc60d816a345 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2986,6 +2986,20 @@ static int ice_max_xdp_frame_size(struct ice_vsi *vsi)
return ICE_RXBUF_3072;
}

+/**
+ * ice_rebuild_pending - ice_vsi_rebuild will be performed, when locks are released
+ * @vsi: VSI to setup XDP for
+ *
+ * ice_vsi_close() in the reset path is called under rtnl_lock(),
+ * so it happens strictly before or after .ndo_bpf().
+ * In case it has happened before, we do not have anything attached to rings
+ */
+static bool ice_rebuild_pending(struct ice_vsi *vsi)
+{
+ return ice_is_reset_in_progress(vsi->back->state) &&
+ !vsi->rx_rings[0]->desc;
+}
+
/**
* ice_xdp_setup_prog - Add or remove XDP eBPF program
* @vsi: VSI to setup XDP for
@@ -3009,7 +3023,7 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
}

/* hot swap progs and avoid toggling link */
- if (ice_is_xdp_ena_vsi(vsi) == !!prog) {
+ if (ice_is_xdp_ena_vsi(vsi) == !!prog || ice_rebuild_pending(vsi)) {
ice_vsi_assign_bpf_prog(vsi, prog);
return 0;
}
@@ -3081,21 +3095,30 @@ static int ice_xdp(struct net_device *dev, struct netdev_bpf *xdp)
{
struct ice_netdev_priv *np = netdev_priv(dev);
struct ice_vsi *vsi = np->vsi;
+ struct ice_pf *pf = vsi->back;
+ int ret;

if (vsi->type != ICE_VSI_PF) {
NL_SET_ERR_MSG_MOD(xdp->extack, "XDP can be loaded only on PF VSI");
return -EINVAL;
}

+ while (test_and_set_bit(ICE_CFG_BUSY, pf->state))
+ usleep_range(1000, 2000);
+
switch (xdp->command) {
case XDP_SETUP_PROG:
- return ice_xdp_setup_prog(vsi, xdp->prog, xdp->extack);
+ ret = ice_xdp_setup_prog(vsi, xdp->prog, xdp->extack);
+ break;
case XDP_SETUP_XSK_POOL:
- return ice_xsk_pool_setup(vsi, xdp->xsk.pool,
- xdp->xsk.queue_id);
+ ret = ice_xsk_pool_setup(vsi, xdp->xsk.pool, xdp->xsk.queue_id);
+ break;
default:
- return -EINVAL;
+ ret = -EINVAL;
}
+
+ clear_bit(ICE_CFG_BUSY, pf->state);
+ return ret;
}

/**
@@ -7672,7 +7695,10 @@ static void ice_rebuild(struct ice_pf *pf, enum ice_reset_req reset_type)
ice_gnss_init(pf);

/* rebuild PF VSI */
+ while (test_and_set_bit(ICE_CFG_BUSY, pf->state))
+ usleep_range(1000, 2000);
err = ice_vsi_rebuild_by_type(pf, ICE_VSI_PF);
+ clear_bit(ICE_CFG_BUSY, pf->state);
if (err) {
dev_err(dev, "PF VSI rebuild failed: %d\n", err);
goto err_vsi_rebuild;
--
2.43.0

2024-06-10 15:46:07

by Larysa Zaremba

[permalink] [raw]

Subject: [PATCH iwl-net 2/3] ice: fix locking in ice_xsk_pool_setup()

With ICE_CFG_BUSY PF state flag locking used in ice_xdp(), there is no need
to lock with VSI state inside ice_xsk_pool_setup(). For robust
synchronization the state between reset preparation and PF VSI rebuild has
to be handled, in the same way as in ice_xdp_setup_prog().

Remove locking logic from ice_qp_dis() and ice_qp_ena() and skip those
functions, if rebuild is pending.

Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Reviewed-by: Igor Bagnucki <[email protected]>
Signed-off-by: Larysa Zaremba <[email protected]>
---
drivers/net/ethernet/intel/ice/ice.h | 1 +
drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
drivers/net/ethernet/intel/ice/ice_xsk.c | 12 ++----------
3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 701a61d791dd..76590cfcaf68 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -941,6 +941,7 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog,
enum ice_xdp_cfg cfg_type);
int ice_destroy_xdp_rings(struct ice_vsi *vsi, enum ice_xdp_cfg cfg_type);
void ice_map_xdp_rings(struct ice_vsi *vsi);
+bool ice_rebuild_pending(struct ice_vsi *vsi);
int
ice_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
u32 flags);
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index dc60d816a345..cd8be3c3b956 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2994,7 +2994,7 @@ static int ice_max_xdp_frame_size(struct ice_vsi *vsi)
* so it happens strictly before or after .ndo_bpf().
* In case it has happened before, we do not have anything attached to rings
*/
-static bool ice_rebuild_pending(struct ice_vsi *vsi)
+bool ice_rebuild_pending(struct ice_vsi *vsi)
{
return ice_is_reset_in_progress(vsi->back->state) &&
!vsi->rx_rings[0]->desc;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 4e2020ab0825..6c95bebd7777 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -163,7 +163,6 @@ static int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
struct ice_tx_ring *xdp_ring;
struct ice_tx_ring *tx_ring;
struct ice_rx_ring *rx_ring;
- int timeout = 50;
int fail = 0;
int err;

@@ -175,13 +174,6 @@ static int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
xdp_ring = vsi->xdp_rings[q_idx];
q_vector = rx_ring->q_vector;

- while (test_and_set_bit(ICE_CFG_BUSY, vsi->state)) {
- timeout--;
- if (!timeout)
- return -EBUSY;
- usleep_range(1000, 2000);
- }
-
synchronize_net();
netif_trans_update(vsi->netdev);
netif_carrier_off(vsi->netdev);
@@ -251,7 +243,6 @@ static int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
synchronize_net();
netif_tx_start_queue(netdev_get_tx_queue(vsi->netdev, q_idx));
netif_carrier_on(vsi->netdev);
- clear_bit(ICE_CFG_BUSY, vsi->state);

return fail;
}
@@ -379,7 +370,8 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
return -EINVAL;
}

- if_running = netif_running(vsi->netdev) && ice_is_xdp_ena_vsi(vsi);
+ if_running = !ice_rebuild_pending(vsi) &&
+ netif_running(vsi->netdev) && ice_is_xdp_ena_vsi(vsi);

if (if_running) {
struct ice_rx_ring *rx_ring = vsi->rx_rings[qid];
--
2.43.0

2024-06-10 16:25:34

by Larysa Zaremba

[permalink] [raw]

Subject: [PATCH iwl-net 3/3] ice: make NAPI setting code aware that rtnl-locked request is waiting

ice_vsi_rebuild() is a critical section in the reset path. Sometimes,
rtnl-locked callbacks have to wait for it to finish. However, it is
possible they will wait for an eternity, because the critical section
contains calls to ice_queue_set_napi() that try to take rtnl_lock.

Make ice_queue_set_napi() aware that some rtnl-locked code is waiting and
skip taking the lock, if this is the case.

Fixes: 080b0c8d6d26 ("ice: Fix ASSERT_RTNL() warning during certain scenarios")
Reviewed-by: Igor Bagnucki <[email protected]>
Signed-off-by: Larysa Zaremba <[email protected]>
---
drivers/net/ethernet/intel/ice/ice.h | 1 +
drivers/net/ethernet/intel/ice/ice_lib.c | 18 +++++++++++++++---
drivers/net/ethernet/intel/ice/ice_main.c | 5 ++++-
3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 76590cfcaf68..7c1e24afa34b 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -307,6 +307,7 @@ enum ice_pf_state {
ICE_PHY_INIT_COMPLETE,
ICE_FD_VF_FLUSH_CTX, /* set at FD Rx IRQ or timeout */
ICE_AUX_ERR_PENDING,
+ ICE_RTNL_WAITS_FOR_RESET,
ICE_STATE_NBITS /* must be last */
};

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 4774bcc4d5a8..a5dc6fc6e63d 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2740,12 +2740,24 @@ ice_queue_set_napi(struct ice_vsi *vsi, unsigned int queue_index,
if (current_work() == &pf->serv_task ||
test_bit(ICE_PREPARED_FOR_RESET, pf->state) ||
test_bit(ICE_DOWN, pf->state) ||
- test_bit(ICE_SUSPENDED, pf->state))
+ test_bit(ICE_SUSPENDED, pf->state)) {
+ bool rtnl_held_here = true;
+
+ while (!rtnl_trylock()) {
+ if (test_bit(ICE_RTNL_WAITS_FOR_RESET, pf->state)) {
+ rtnl_held_here = false;
+ break;
+ }
+ usleep_range(1000, 2000);
+ }
__ice_queue_set_napi(vsi->netdev, queue_index, type, napi,
- false);
- else
+ true);
+ if (rtnl_held_here)
+ rtnl_unlock();
+ } else {
__ice_queue_set_napi(vsi->netdev, queue_index, type, napi,
true);
+ }
}

/**
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index cd8be3c3b956..37b3dd22cdc2 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3103,8 +3103,11 @@ static int ice_xdp(struct net_device *dev, struct netdev_bpf *xdp)
return -EINVAL;
}

- while (test_and_set_bit(ICE_CFG_BUSY, pf->state))
+ while (test_and_set_bit(ICE_CFG_BUSY, pf->state)) {
+ set_bit(ICE_RTNL_WAITS_FOR_RESET, pf->state);
usleep_range(1000, 2000);
+ }
+ clear_bit(ICE_RTNL_WAITS_FOR_RESET, pf->state);

switch (xdp->command) {
case XDP_SETUP_PROG:
--
2.43.0

2024-06-12 02:41:29

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH iwl-net 0/3] ice: fix synchronization between .ndo_bpf() and reset

On Mon, 10 Jun 2024 17:37:12 +0200 Larysa Zaremba wrote:
> Fix the problems that are triggered by tx_timeout and ice_xdp() calls,
> including both pool and program operations.

Is there really no way for ice to fix the locking? :(
The busy loops and trylocks() are not great, and seem like duct tape.

2024-06-12 06:58:19

by Larysa Zaremba

[permalink] [raw]

Subject: Re: [PATCH iwl-net 0/3] ice: fix synchronization between .ndo_bpf() and reset

On Tue, Jun 11, 2024 at 07:38:37PM -0700, Jakub Kicinski wrote:
> On Mon, 10 Jun 2024 17:37:12 +0200 Larysa Zaremba wrote:
> > Fix the problems that are triggered by tx_timeout and ice_xdp() calls,
> > including both pool and program operations.
>
> Is there really no way for ice to fix the locking? :(
> The busy loops and trylocks() are not great, and seem like duct tape.
>

The locking mechanisms I use here do not look pretty, but if I am not missing
anything, the synchronization they provide must be robust.

A prettier way of protecting the same critical sections would be replacing
ICE_CFG_BUSY around ice_vsi_rebuild() with rtnl_lock(), this would eliminate
locking code from .ndo_bpf() altogether, ice_rebuild_pending() logic will have
to stay.

At some point I have decided to avoid using rtnl_lock(), if I do not have to. I
think this is a goal worth pursuing?

2024-06-12 21:09:54

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH iwl-net 0/3] ice: fix synchronization between .ndo_bpf() and reset

On Wed, 12 Jun 2024 08:56:38 +0200 Larysa Zaremba wrote:
> On Tue, Jun 11, 2024 at 07:38:37PM -0700, Jakub Kicinski wrote:
> > On Mon, 10 Jun 2024 17:37:12 +0200 Larysa Zaremba wrote:
> > > Fix the problems that are triggered by tx_timeout and ice_xdp() calls,
> > > including both pool and program operations.
> >
> > Is there really no way for ice to fix the locking? :(
> > The busy loops and trylocks() are not great, and seem like duct tape.
>
> The locking mechanisms I use here do not look pretty, but if I am not missing
> anything, the synchronization they provide must be robust.

Robust as in they may be correct here, but you lose lockdep and all
other infra normal mutex would give you.

> A prettier way of protecting the same critical sections would be replacing
> ICE_CFG_BUSY around ice_vsi_rebuild() with rtnl_lock(), this would eliminate
> locking code from .ndo_bpf() altogether, ice_rebuild_pending() logic will have
> to stay.
>
> At some point I have decided to avoid using rtnl_lock(), if I do not have to. I
> think this is a goal worth pursuing?

Is the reset for failure recovery, rather than reconfiguration?
If so netif_device_detach() is generally the best way of avoiding
getting called (I think I mentioned it to someone @intal recently).

2024-06-13 08:55:52

by Larysa Zaremba

[permalink] [raw]

Subject: Re: [PATCH iwl-net 0/3] ice: fix synchronization between .ndo_bpf() and reset

On Wed, Jun 12, 2024 at 02:09:35PM -0700, Jakub Kicinski wrote:
> On Wed, 12 Jun 2024 08:56:38 +0200 Larysa Zaremba wrote:
> > On Tue, Jun 11, 2024 at 07:38:37PM -0700, Jakub Kicinski wrote:
> > > On Mon, 10 Jun 2024 17:37:12 +0200 Larysa Zaremba wrote:
> > > > Fix the problems that are triggered by tx_timeout and ice_xdp() calls,
> > > > including both pool and program operations.
> > >
> > > Is there really no way for ice to fix the locking? :(
> > > The busy loops and trylocks() are not great, and seem like duct tape.
> >
> > The locking mechanisms I use here do not look pretty, but if I am not missing
> > anything, the synchronization they provide must be robust.
>
> Robust as in they may be correct here, but you lose lockdep and all
> other infra normal mutex would give you.
>

I know, but __netif_queue_set_napi() requires rtnl_lock() inside the potential
critical section and creates a deadlock this way. However, after reading
patches that introduce this function, I think it is called too early in the
configuration. Seems like it should be called somewhere right after
netif_set_real_num_rx/_tx_queues(), much later in the configuration where we
already hold the rtnl_lock(). In such way, ice_vsi_rebuild() could be protected
with an internal mutex. WDYT?

> > A prettier way of protecting the same critical sections would be replacing
> > ICE_CFG_BUSY around ice_vsi_rebuild() with rtnl_lock(), this would eliminate
> > locking code from .ndo_bpf() altogether, ice_rebuild_pending() logic will have
> > to stay.
> >
> > At some point I have decided to avoid using rtnl_lock(), if I do not have to. I
> > think this is a goal worth pursuing?
>
> Is the reset for failure recovery, rather than reconfiguration?
> If so netif_device_detach() is generally the best way of avoiding
> getting called (I think I mentioned it to someone @intal recently).

AFAIK, netif_device_detach() does not affect .ndo_bpf() calls. We were trying
such approach with idpf and it does work for ethtool, but not for XDP.

2024-06-13 10:45:52

by Przemek Kitszel

[permalink] [raw]

Subject: Re: [Intel-wired-lan] [PATCH iwl-net 0/3] ice: fix synchronization between .ndo_bpf() and reset

On 6/13/24 10:54, Larysa Zaremba wrote:
> On Wed, Jun 12, 2024 at 02:09:35PM -0700, Jakub Kicinski wrote:
>> On Wed, 12 Jun 2024 08:56:38 +0200 Larysa Zaremba wrote:
>>> On Tue, Jun 11, 2024 at 07:38:37PM -0700, Jakub Kicinski wrote:
>>>> On Mon, 10 Jun 2024 17:37:12 +0200 Larysa Zaremba wrote:
>>>>> Fix the problems that are triggered by tx_timeout and ice_xdp() calls,
>>>>> including both pool and program operations.
>>>>
>>>> Is there really no way for ice to fix the locking? :(
>>>> The busy loops and trylocks() are not great, and seem like duct tape.
>>>
>>> The locking mechanisms I use here do not look pretty, but if I am not missing
>>> anything, the synchronization they provide must be robust.
>>
>> Robust as in they may be correct here, but you lose lockdep and all
>> other infra normal mutex would give you.
>>
>
> I know, but __netif_queue_set_napi() requires rtnl_lock() inside the potential
> critical section and creates a deadlock this way. However, after reading
> patches that introduce this function, I think it is called too early in the
> configuration. Seems like it should be called somewhere right after
> netif_set_real_num_rx/_tx_queues(), much later in the configuration where we
> already hold the rtnl_lock(). In such way, ice_vsi_rebuild() could be protected
> with an internal mutex. WDYT?
>
>>> A prettier way of protecting the same critical sections would be replacing
>>> ICE_CFG_BUSY around ice_vsi_rebuild() with rtnl_lock(), this would eliminate
>>> locking code from .ndo_bpf() altogether, ice_rebuild_pending() logic will have
>>> to stay.
>>>
>>> At some point I have decided to avoid using rtnl_lock(), if I do not have to. I
>>> think this is a goal worth pursuing?
>>
>> Is the reset for failure recovery, rather than reconfiguration?
>> If so netif_device_detach() is generally the best way of avoiding
>> getting called (I think I mentioned it to someone @intal recently).

for the reference, it was to Dawid here:
https://lore.kernel.org/netdev/[email protected]/

>
> AFAIK, netif_device_detach() does not affect .ndo_bpf() calls. We were trying
> such approach with idpf and it does work for ethtool, but not for XDP.

2024-06-13 14:14:20

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH iwl-net 0/3] ice: fix synchronization between .ndo_bpf() and reset

On Thu, 13 Jun 2024 10:54:12 +0200 Larysa Zaremba wrote:
> > > The locking mechanisms I use here do not look pretty, but if I am not missing
> > > anything, the synchronization they provide must be robust.
> >
> > Robust as in they may be correct here, but you lose lockdep and all
> > other infra normal mutex would give you.
>
> I know, but __netif_queue_set_napi() requires rtnl_lock() inside the potential
> critical section and creates a deadlock this way. However, after reading
> patches that introduce this function, I think it is called too early in the
> configuration. Seems like it should be called somewhere right after
> netif_set_real_num_rx/_tx_queues(), much later in the configuration where we
> already hold the rtnl_lock(). In such way, ice_vsi_rebuild() could be protected
> with an internal mutex. WDYT?

On a quick look I think that may work. For setting the NAPI it makes
sense - netif_set_real_num_rx/_tx_queues() and netif_queue_set_napi()
both inform netdev about the queue config, so its logical to keep them
together. I was worried there may be an inconveniently placed
netif_queue_set_napi() call which is clearing the NAPI pointer.
But I don't see one.

> > > A prettier way of protecting the same critical sections would be replacing
> > > ICE_CFG_BUSY around ice_vsi_rebuild() with rtnl_lock(), this would eliminate
> > > locking code from .ndo_bpf() altogether, ice_rebuild_pending() logic will have
> > > to stay.
> > >
> > > At some point I have decided to avoid using rtnl_lock(), if I do not have to. I
> > > think this is a goal worth pursuing?
> >
> > Is the reset for failure recovery, rather than reconfiguration?
> > If so netif_device_detach() is generally the best way of avoiding
> > getting called (I think I mentioned it to someone @intal recently).
>
> AFAIK, netif_device_detach() does not affect .ndo_bpf() calls. We were trying
> such approach with idpf and it does work for ethtool, but not for XDP.

I reckon that's an unintentional omission. In theory XDP is "pure
software" but if the device is running driver will likely have to
touch HW to reconfigure. So, if you're willing, do send a ndo_bpf
patch to add a detached check.

2024-06-13 15:36:45

by Larysa Zaremba

[permalink] [raw]

Subject: Re: [PATCH iwl-net 0/3] ice: fix synchronization between .ndo_bpf() and reset

On Thu, Jun 13, 2024 at 07:13:43AM -0700, Jakub Kicinski wrote:
> On Thu, 13 Jun 2024 10:54:12 +0200 Larysa Zaremba wrote:
> > > > The locking mechanisms I use here do not look pretty, but if I am not missing
> > > > anything, the synchronization they provide must be robust.
> > >
> > > Robust as in they may be correct here, but you lose lockdep and all
> > > other infra normal mutex would give you.
> >
> > I know, but __netif_queue_set_napi() requires rtnl_lock() inside the potential
> > critical section and creates a deadlock this way. However, after reading
> > patches that introduce this function, I think it is called too early in the
> > configuration. Seems like it should be called somewhere right after
> > netif_set_real_num_rx/_tx_queues(), much later in the configuration where we
> > already hold the rtnl_lock(). In such way, ice_vsi_rebuild() could be protected
> > with an internal mutex. WDYT?
>
> On a quick look I think that may work. For setting the NAPI it makes
> sense - netif_set_real_num_rx/_tx_queues() and netif_queue_set_napi()
> both inform netdev about the queue config, so its logical to keep them
> together. I was worried there may be an inconveniently placed
> netif_queue_set_napi() call which is clearing the NAPI pointer.
> But I don't see one.
>

Ok, will do this in v2. Thanks for the discussion.

> > > > A prettier way of protecting the same critical sections would be replacing
> > > > ICE_CFG_BUSY around ice_vsi_rebuild() with rtnl_lock(), this would eliminate
> > > > locking code from .ndo_bpf() altogether, ice_rebuild_pending() logic will have
> > > > to stay.
> > > >
> > > > At some point I have decided to avoid using rtnl_lock(), if I do not have to. I
> > > > think this is a goal worth pursuing?
> > >
> > > Is the reset for failure recovery, rather than reconfiguration?
> > > If so netif_device_detach() is generally the best way of avoiding
> > > getting called (I think I mentioned it to someone @intal recently).
> >
> > AFAIK, netif_device_detach() does not affect .ndo_bpf() calls. We were trying
> > such approach with idpf and it does work for ethtool, but not for XDP.
>
> I reckon that's an unintentional omission. In theory XDP is "pure
> software" but if the device is running driver will likely have to
> touch HW to reconfigure. So, if you're willing, do send a ndo_bpf
> patch to add a detached check.

This does not seem that simple. In cases of program/pool detachment,
.ndo_bpf() does not accept 'no' as an answer, so there is no easy existing way
of handling !netif_device_present() either. And we have to notify the driver
that pool/program is no longer needed no matter what. So what is left is somehow
postpone pool/prog removal to after the netdev gets attached again.

2024-06-13 15:47:51

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH iwl-net 0/3] ice: fix synchronization between .ndo_bpf() and reset

On Thu, 13 Jun 2024 17:36:16 +0200 Larysa Zaremba wrote:
> > > AFAIK, netif_device_detach() does not affect .ndo_bpf() calls. We were trying
> > > such approach with idpf and it does work for ethtool, but not for XDP.
> >
> > I reckon that's an unintentional omission. In theory XDP is "pure
> > software" but if the device is running driver will likely have to
> > touch HW to reconfigure. So, if you're willing, do send a ndo_bpf
> > patch to add a detached check.
>
> This does not seem that simple. In cases of program/pool detachment,
> .ndo_bpf() does not accept 'no' as an answer, so there is no easy existing way
> of handling !netif_device_present() either. And we have to notify the driver
> that pool/program is no longer needed no matter what. So what is left is somehow
> postpone pool/prog removal to after the netdev gets attached again.

I see, thanks for investigating!