2023-04-11 12:11:22

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 0/4] nvmet-fcloop: unblock module removal

blktests is not able to unload the FC related modules. It is possible to unload
the modules but it still will not work correctly. The host and the controller
seem to be in a kind of live deadlock:

loop: module loaded
run blktests nvme/003 at 2023-04-11 13:55:57
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): {0:0} Association created
[71] nvmet: ctrl 1 start keep-alive timer for 120 secs
nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8.
nvme nvme0: NVME-FC{0}: controller connect complete
nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme1: NVME-FC{1}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:1} Association created
[453] nvmet: ctrl 2 start keep-alive timer for 5 secs
nvmet: creating nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8.
[71] nvmet: adding queue 1 to ctrl 2.
[45] nvmet: adding queue 2 to ctrl 2.
[453] nvmet: adding queue 3 to ctrl 2.
[105] nvmet: adding queue 4 to ctrl 2.
nvme nvme1: NVME-FC{1}: controller connect complete
nvme nvme1: NVME-FC{1}: new ctrl: NQN "blktests-subsystem-1"
[453] nvmet: ctrl 2 reschedule traffic based keep-alive timer
[105] nvmet: ctrl 2 update keep-alive timer for 5 secs
[105] nvmet: ctrl 2 update keep-alive timer for 5 secs
nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[45] nvmet: ctrl 1 stop keep-alive
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
nvme nvme1: rescanning namespaces.
nvme nvme1: NVME-FC{1}: io failed due to lldd error 6
nvme nvme1: NVME-FC{1}: transport association event: transport detected io error
nvme nvme1: NVME-FC{1}: resetting controller
[105] nvmet: ctrl 2 stop keep-alive
nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): {0:1} Association deleted
(NULL device *): {0:0} Association created
(NULL device *): {0:1} Association freed
nvmet: connect request for invalid subsystem nqn.2014-08.org.nvmexpress.discovery!
nvme nvme0: Connect Invalid Data Parameter, subsysnqn "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): Disconnect LS failed: No Association
nvme nvme1: NVME-FC{1}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
nvme nvme0: NVME-FC{0}: reset: Reconnect attempt failed (16770)
(NULL device *): {0:1} Association created
nvme nvme0: NVME-FC{0}: reconnect failure
nvmet: connect request for invalid subsystem blktests-subsystem-1!
nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme1: Connect Invalid Data Parameter, subsysnqn "blktests-subsystem-1"
nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme1: NVME-FC{1}: reset: Reconnect attempt failed (16770)
nvme nvme1: NVME-FC{1}: reconnect failure
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
(NULL device *): {0:1} Association deleted
(NULL device *): {0:1} Association freed
(NULL device *): Disconnect LS failed: No Association
nvmet_fc: nvmet_fc_exit_module: targetport list not empty

I think these patches here are not very controversial and should propably go in
even if we still haven't fix for the above scenario.

v2:
- added additional fixes

v1:
- initial version
- https://lore.kernel.org/linux-nvme/[email protected]/

Daniel Wagner (4):
nvmet-fcloop: Remove remote port from list when unlinking
nvmet-fcloop: Do not wait on completion when unregister fails
nvmet-fc: Do not wait in vain when unloading module
nvmet-fc: Release reference on target port

drivers/nvme/host/fc.c | 20 +++++++++++++-------
drivers/nvme/target/fc.c | 1 +
drivers/nvme/target/fcloop.c | 5 ++++-
3 files changed, 18 insertions(+), 8 deletions(-)

--
2.40.0


2023-04-11 12:11:49

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 1/4] nvmet-fcloop: Remove remote port from list when unlinking

The remote port is never removed from fcloop_nports list. During module
unloading it's possible to end up an busy loop in fcloop_exit, because
the remote port is found in the list and thus we will never progress.

The kernel log will be spammed with

nvme_fcloop: fcloop_exit: Failed deleting remote port
nvme_fcloop: fcloop_exit: Failed deleting target port

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/target/fcloop.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c
index 5c16372f3b53..1e53c8fe4b95 100644
--- a/drivers/nvme/target/fcloop.c
+++ b/drivers/nvme/target/fcloop.c
@@ -1352,6 +1352,8 @@ __unlink_remote_port(struct fcloop_nport *nport)
nport->tport->remoteport = NULL;
nport->rport = NULL;

+ list_del(&nport->nport_list);
+
return rport;
}

--
2.40.0

2023-04-11 12:12:37

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 3/4] nvmet-fc: Do not wait in vain when unloading module

When there is no controller to be deleted the module unload path will
still wait on the nvme_fc_unload_proceed completion. Because this will
will never happen the caller will hang forever.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fc.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 456ee42a6133..df85cf93742b 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3933,10 +3933,11 @@ static int __init nvme_fc_init_module(void)
return ret;
}

-static void
+static bool
nvme_fc_delete_controllers(struct nvme_fc_rport *rport)
{
struct nvme_fc_ctrl *ctrl;
+ bool cleanup = false;

spin_lock(&rport->lock);
list_for_each_entry(ctrl, &rport->ctrl_list, ctrl_list) {
@@ -3944,21 +3945,28 @@ nvme_fc_delete_controllers(struct nvme_fc_rport *rport)
"NVME-FC{%d}: transport unloading: deleting ctrl\n",
ctrl->cnum);
nvme_delete_ctrl(&ctrl->ctrl);
+ cleanup = true;
}
spin_unlock(&rport->lock);
+
+ return cleanup;
}

-static void
+static bool
nvme_fc_cleanup_for_unload(void)
{
struct nvme_fc_lport *lport;
struct nvme_fc_rport *rport;
+ bool cleanup = false;

list_for_each_entry(lport, &nvme_fc_lport_list, port_list) {
list_for_each_entry(rport, &lport->endp_list, endp_list) {
- nvme_fc_delete_controllers(rport);
+ if (nvme_fc_delete_controllers(rport))
+ cleanup = true;
}
}
+
+ return cleanup;
}

static void __exit nvme_fc_exit_module(void)
@@ -3968,10 +3976,8 @@ static void __exit nvme_fc_exit_module(void)

spin_lock_irqsave(&nvme_fc_lock, flags);
nvme_fc_waiting_to_unload = true;
- if (!list_empty(&nvme_fc_lport_list)) {
- need_cleanup = true;
- nvme_fc_cleanup_for_unload();
- }
+ if (!list_empty(&nvme_fc_lport_list))
+ need_cleanup = nvme_fc_cleanup_for_unload();
spin_unlock_irqrestore(&nvme_fc_lock, flags);
if (need_cleanup) {
pr_info("%s: waiting for ctlr deletes\n", __func__);
--
2.40.0

2023-04-11 12:15:14

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 2/4] nvmet-fcloop: Do not wait on completion when unregister fails

The nvme_fc_unregister_localport() returns an error code in case that
the locaport pointer is NULL or has already been unegisterd. localport is
is either in the ONLINE state (all resources allocated) or has already
been put into DELETED state.

In this case we will never receive an wakeup call and thus any caller
will hang, e.g. module unload.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/target/fcloop.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c
index 1e53c8fe4b95..6c3905498a94 100644
--- a/drivers/nvme/target/fcloop.c
+++ b/drivers/nvme/target/fcloop.c
@@ -1162,7 +1162,8 @@ __wait_localport_unreg(struct fcloop_lport *lport)

ret = nvme_fc_unregister_localport(lport->localport);

- wait_for_completion(&lport->unreg_done);
+ if (!ret)
+ wait_for_completion(&lport->unreg_done);

kfree(lport);

--
2.40.0

2023-04-11 15:04:56

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] nvmet-fcloop: unblock module removal

On Tue, Apr 11, 2023 at 02:07:14PM +0200, Daniel Wagner wrote:
> loop: module loaded
> run blktests nvme/003 at 2023-04-11 13:55:57
> nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
> (NULL device *): {0:0} Association created
> [71] nvmet: ctrl 1 start keep-alive timer for 120 secs
> nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8.
> nvme nvme0: NVME-FC{0}: controller connect complete
> nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> nvme nvme1: NVME-FC{1}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
> (NULL device *): {0:1} Association created
> [453] nvmet: ctrl 2 start keep-alive timer for 5 secs
> nvmet: creating nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8.
> [71] nvmet: adding queue 1 to ctrl 2.
> [45] nvmet: adding queue 2 to ctrl 2.
> [453] nvmet: adding queue 3 to ctrl 2.
> [105] nvmet: adding queue 4 to ctrl 2.
> nvme nvme1: NVME-FC{1}: controller connect complete
> nvme nvme1: NVME-FC{1}: new ctrl: NQN "blktests-subsystem-1"
> [453] nvmet: ctrl 2 reschedule traffic based keep-alive timer
> [105] nvmet: ctrl 2 update keep-alive timer for 5 secs
> [105] nvmet: ctrl 2 update keep-alive timer for 5 secs
> nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> [45] nvmet: ctrl 1 stop keep-alive
> (NULL device *): {0:0} Association deleted
> (NULL device *): {0:0} Association freed
> (NULL device *): Disconnect LS failed: No Association

It turns out this sequence is triggered the udev auto connect rules. As I read
the _setup_nvmet() code, the connect should happen via blktest and not udev.
Obviously, this messes with the test case. I'll try to figure out how we can
disable the udev rules during test run.