LinuxLists.cc - [PATCH v0 0/6] nvme-fc: fix blktests nvme/041

2024-02-16 08:48:49

Subject: [PATCH v0 0/6] nvme-fc: fix blktests nvme/041

After the target side is working with blktests and blktest is also able
to deal with the FC transport, it's time to address the fallouts on the host
side. As first step, let's fix the failing nvme/041 tests.

As we arleady discussed, the main issue here is that FC transport is deferring
the connect attempt to a workqueue. The other fabric transport don't do this.
And all blktests expect that the 'nvme connect' call is synchronous.

Initially, I just added the completion and waited on connect to succeed or fail.
But this triggered a lot of UAFs. After banging my head on this problem for a
while I decided to replace the complete ref counting strategy.

With this new approach all execept nvme/048 are passing and no UAFs or other
troubles observed. I also tested with real hardware (lpfc, qla2xxx), though I
don't have a way to trigger all sorts of transport errors which would be
interesting to see if my patches are breaking anything.

I think there is still on problem left in the module exit code path. The cleanup
function iterates over the ctrl list storred in the rport object. The delete
code path is not atomic and removes the controller from the list somewhere in
the delete path. Thus this races with the module unload, IMO. we could just
maintain a list of controllers which is protected a lock as we have in tcp/rdma.

Daniel Wagner (6):
nvme-fabrics: introduce connect_sync option
nvme-fc: rename free_ctrl callback to match name pattern
nvme-fc: do not retry when auth fails or connection is refused
nvme-fabrics: introduce ref counting for nvmf_ctrl_options
nvme-fc: redesign locking and refcounting
nvme-fc: wait for connect attempt to finish

drivers/nvme/host/fabrics.c | 28 +++++-
drivers/nvme/host/fabrics.h | 9 +-
drivers/nvme/host/fc.c | 180 ++++++++++++++++--------------------
drivers/nvme/host/rdma.c | 18 +++-
drivers/nvme/host/tcp.c | 21 +++--
drivers/nvme/target/loop.c | 19 ++--
6 files changed, 150 insertions(+), 125 deletions(-)

--
2.43.0

2024-02-16 08:49:04

by Daniel Wagner

[permalink] [raw]

Subject: [PATCH v0 1/6] nvme-fabrics: introduce connect_sync option

The TCP and RDMA transport are doing a synchronous connect, meaning the
syscal returns with the final result, that is. it either failed or
succeeded.

This isn't the case for FC. This transport just setups and triggers
the connect and returns without waiting on the result. Introduce a flag
to allow user space to control the behavior, wait or don't wait.

Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fabrics.c | 6 +++++-
drivers/nvme/host/fabrics.h | 3 +++
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 3499acbf6a82..7d33f0f5824f 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -678,6 +678,7 @@ static const match_table_t opt_tokens = {
#ifdef CONFIG_NVME_TCP_TLS
{ NVMF_OPT_TLS, "tls" },
#endif
+ { NVMF_OPT_CONNECT_SYNC, "connect_sync" },
{ NVMF_OPT_ERR, NULL }
};

@@ -1024,6 +1025,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
}
opts->tls = true;
break;
+ case NVMF_OPT_CONNECT_SYNC:
+ opts->connect_sync = true;
+ break;
default:
pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
p);
@@ -1245,7 +1249,7 @@ EXPORT_SYMBOL_GPL(nvmf_free_options);
NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
NVMF_OPT_DISABLE_SQFLOW | NVMF_OPT_DISCOVERY |\
NVMF_OPT_FAIL_FAST_TMO | NVMF_OPT_DHCHAP_SECRET |\
- NVMF_OPT_DHCHAP_CTRL_SECRET)
+ NVMF_OPT_DHCHAP_CTRL_SECRET | NVMF_OPT_CONNECT_SYNC)

static struct nvme_ctrl *
nvmf_create_ctrl(struct device *dev, const char *buf)
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 06cc54851b1b..01d3ef545f14 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -73,6 +73,7 @@ enum {
NVMF_OPT_TLS = 1 << 25,
NVMF_OPT_KEYRING = 1 << 26,
NVMF_OPT_TLS_KEY = 1 << 27,
+ NVMF_OPT_CONNECT_SYNC = 1 << 28,
};

/**
@@ -115,6 +116,7 @@ enum {
* @nr_poll_queues: number of queues for polling I/O
* @tos: type of service
* @fast_io_fail_tmo: Fast I/O fail timeout in seconds
+ * @connect_sync: wait for connect attempt(s) to succeed or fail
*/
struct nvmf_ctrl_options {
unsigned mask;
@@ -144,6 +146,7 @@ struct nvmf_ctrl_options {
unsigned int nr_poll_queues;
int tos;
int fast_io_fail_tmo;
+ bool connect_sync;
};

/*
--
2.43.0

2024-02-16 08:49:20

by Daniel Wagner

[permalink] [raw]

Subject: [PATCH v0 2/6] nvme-fc: rename free_ctrl callback to match name pattern

Rename nvme_fc_nvme_ctrl_freed to nvme_fc_free_ctrl to match the name
pattern for the callback.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 68a5d971657b..a5b29e9ad342 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2428,7 +2428,7 @@ nvme_fc_ctrl_get(struct nvme_fc_ctrl *ctrl)
* controller. Called after last nvme_put_ctrl() call
*/
static void
-nvme_fc_nvme_ctrl_freed(struct nvme_ctrl *nctrl)
+nvme_fc_free_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);

@@ -3384,7 +3384,7 @@ static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
.reg_read32 = nvmf_reg_read32,
.reg_read64 = nvmf_reg_read64,
.reg_write32 = nvmf_reg_write32,
- .free_ctrl = nvme_fc_nvme_ctrl_freed,
+ .free_ctrl = nvme_fc_free_ctrl,
.submit_async_event = nvme_fc_submit_async_event,
.delete_ctrl = nvme_fc_delete_ctrl,
.get_address = nvmf_get_address,
--
2.43.0

2024-02-16 08:49:50

by Daniel Wagner

[permalink] [raw]

Subject: [PATCH v0 3/6] nvme-fc: do not retry when auth fails or connection is refused

The is no point in retrying to connect if the authentication fails.

Connection refused is also issued from the authentication path, thus
also do not retry.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fc.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index a5b29e9ad342..b81046c9f171 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3312,6 +3312,8 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
ctrl->cnum, status);
if (status > 0 && (status & NVME_SC_DNR))
recon = false;
+ if (status == NVME_SC_AUTH_REQUIRED || status == -ECONNREFUSED)
+ recon = false;
} else if (time_after_eq(jiffies, rport->dev_loss_end))
recon = false;

--
2.43.0

2024-02-16 08:50:03

by Daniel Wagner

[permalink] [raw]

Subject: [PATCH v0 5/6] nvme-fc: redesign locking and refcounting

The life time of the controller is managed by the upper layers.

Thus just ref counting the controller when creating it and giving the
ref back on the cleanup path. This is how the other transport are
managed as well. Until now, the ref count has been taken per LS request
which is not really necessary as the core guarantees that there is no in
flight request when shuting down (if we use the nvme APIs are used
correctly).

In fact we don't really need the ref count for nvme_fc_ctrl at this
point. Though, the FC transport is offloading the connect attempt to a
workqueue and in the next patch we introduce a sync option for which the
ref counter is necessary. So let's keep it around.

Also take a ref for lport and rport when creating the controller and
give it back when we destroy the controller. This means these refs are
tied to the life time of the controller and not the other way around.

We have also to reorder the cleanup code in nvme_fc_delete_ctrl and
nvme_fc_free_ctrl so that we do not expose resources too long and run
into use after free situations which are currently possible.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fc.c | 136 +++++++++++++----------------------------
1 file changed, 41 insertions(+), 95 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index ddbc5b21af5b..7f9edab57550 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -229,6 +229,9 @@ static struct device *fc_udev_device;

static void nvme_fc_complete_rq(struct request *rq);

+static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
+static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
+
/* *********************** FC-NVME Port Management ************************ */

static void __nvme_fc_delete_hw_queue(struct nvme_fc_ctrl *,
@@ -800,7 +803,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: Couldn't schedule reset.\n",
ctrl->cnum);
- nvme_delete_ctrl(&ctrl->ctrl);
+ nvme_fc_ctrl_put(ctrl);
}
break;

@@ -868,7 +871,7 @@ nvme_fc_unregister_remoteport(struct nvme_fc_remote_port *portptr)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: controller connectivity lost.\n",
ctrl->cnum);
- nvme_delete_ctrl(&ctrl->ctrl);
+ nvme_fc_ctrl_put(ctrl);
} else
nvme_fc_ctrl_connectivity_loss(ctrl);
}
@@ -1022,9 +1025,6 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,

/* *********************** FC-NVME LS Handling **************************** */

-static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
-static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
-
static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);

static void
@@ -1050,8 +1050,6 @@ __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
fc_dma_unmap_single(rport->dev, lsreq->rqstdma,
(lsreq->rqstlen + lsreq->rsplen),
DMA_BIDIRECTIONAL);
-
- nvme_fc_rport_put(rport);
}

static int
@@ -1066,9 +1064,6 @@ __nvme_fc_send_ls_req(struct nvme_fc_rport *rport,
if (rport->remoteport.port_state != FC_OBJSTATE_ONLINE)
return -ECONNREFUSED;

- if (!nvme_fc_rport_get(rport))
- return -ESHUTDOWN;
-
lsreq->done = done;
lsop->rport = rport;
lsop->req_queued = false;
@@ -1078,10 +1073,8 @@ __nvme_fc_send_ls_req(struct nvme_fc_rport *rport,
lsreq->rqstdma = fc_dma_map_single(rport->dev, lsreq->rqstaddr,
lsreq->rqstlen + lsreq->rsplen,
DMA_BIDIRECTIONAL);
- if (fc_dma_mapping_error(rport->dev, lsreq->rqstdma)) {
- ret = -EFAULT;
- goto out_putrport;
- }
+ if (fc_dma_mapping_error(rport->dev, lsreq->rqstdma))
+ return -EFAULT;
lsreq->rspdma = lsreq->rqstdma + lsreq->rqstlen;

spin_lock_irqsave(&rport->lock, flags);
@@ -1108,9 +1101,6 @@ __nvme_fc_send_ls_req(struct nvme_fc_rport *rport,
fc_dma_unmap_single(rport->dev, lsreq->rqstdma,
(lsreq->rqstlen + lsreq->rsplen),
DMA_BIDIRECTIONAL);
-out_putrport:
- nvme_fc_rport_put(rport);
-
return ret;
}

@@ -1471,8 +1461,6 @@ nvme_fc_xmt_ls_rsp_done(struct nvmefc_ls_rsp *lsrsp)
kfree(lsop->rspbuf);
kfree(lsop->rqstbuf);
kfree(lsop);
-
- nvme_fc_rport_put(rport);
}

static void
@@ -1511,8 +1499,6 @@ nvme_fc_match_disconn_ls(struct nvme_fc_rport *rport,
spin_lock_irqsave(&rport->lock, flags);

list_for_each_entry(ctrl, &rport->ctrl_list, ctrl_list) {
- if (!nvme_fc_ctrl_get(ctrl))
- continue;
spin_lock(&ctrl->lock);
if (association_id == ctrl->association_id) {
oldls = ctrl->rcv_disconn;
@@ -1520,10 +1506,6 @@ nvme_fc_match_disconn_ls(struct nvme_fc_rport *rport,
ret = ctrl;
}
spin_unlock(&ctrl->lock);
- if (ret)
- /* leave the ctrl get reference */
- break;
- nvme_fc_ctrl_put(ctrl);
}

spin_unlock_irqrestore(&rport->lock, flags);
@@ -1602,9 +1584,6 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
/* fail the association */
nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");

- /* release the reference taken by nvme_fc_match_disconn_ls() */
- nvme_fc_ctrl_put(ctrl);
-
return false;
}

@@ -1734,16 +1713,13 @@ nvme_fc_rcv_ls_req(struct nvme_fc_remote_port *portptr,
unsigned long flags;
int ret;

- nvme_fc_rport_get(rport);
-
/* validate there's a routine to transmit a response */
if (!lport->ops->xmt_ls_rsp) {
dev_info(lport->dev,
"RCV %s LS failed: no LLDD xmt_ls_rsp\n",
(w0->ls_cmd <= NVME_FC_LAST_LS_CMD_VALUE) ?
nvmefc_ls_names[w0->ls_cmd] : "");
- ret = -EINVAL;
- goto out_put;
+ return -EINVAL;
}

if (lsreqbuf_len > sizeof(union nvmefc_ls_requests)) {
@@ -1751,15 +1727,13 @@ nvme_fc_rcv_ls_req(struct nvme_fc_remote_port *portptr,
"RCV %s LS failed: payload too large\n",
(w0->ls_cmd <= NVME_FC_LAST_LS_CMD_VALUE) ?
nvmefc_ls_names[w0->ls_cmd] : "");
- ret = -E2BIG;
- goto out_put;
+ return -E2BIG;
}

lsop = kzalloc(sizeof(*lsop), GFP_KERNEL);
if (!lsop) {
nvme_fc_rcv_ls_req_err_msg(lport, w0);
- ret = -ENOMEM;
- goto out_put;
+ return -ENOMEM;
}

lsop->rqstbuf = kzalloc(sizeof(*lsop->rqstbuf), GFP_KERNEL);
@@ -1808,8 +1782,6 @@ nvme_fc_rcv_ls_req(struct nvme_fc_remote_port *portptr,
kfree(lsop->rspbuf);
kfree(lsop->rqstbuf);
kfree(lsop);
-out_put:
- nvme_fc_rport_put(rport);
return ret;
}
EXPORT_SYMBOL_GPL(nvme_fc_rcv_ls_req);
@@ -2071,7 +2043,6 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
atomic_set(&op->state, FCPOP_STATE_IDLE);
op->flags = FCOP_FLAGS_AEN; /* clear other flags */
- nvme_fc_ctrl_put(ctrl);
goto check_error;
}

@@ -2383,37 +2354,18 @@ nvme_fc_init_io_queues(struct nvme_fc_ctrl *ctrl)
}

static void
-nvme_fc_ctrl_free(struct kref *ref)
+nvme_fc_ctrl_delete(struct kref *ref)
{
struct nvme_fc_ctrl *ctrl =
container_of(ref, struct nvme_fc_ctrl, ref);
- unsigned long flags;
-
- if (ctrl->ctrl.tagset)
- nvme_remove_io_tag_set(&ctrl->ctrl);
-
- /* remove from rport list */
- spin_lock_irqsave(&ctrl->rport->lock, flags);
- list_del(&ctrl->ctrl_list);
- spin_unlock_irqrestore(&ctrl->rport->lock, flags);
-
- nvme_unquiesce_admin_queue(&ctrl->ctrl);
- nvme_remove_admin_tag_set(&ctrl->ctrl);
-
- kfree(ctrl->queues);
-
- put_device(ctrl->dev);
- nvme_fc_rport_put(ctrl->rport);

- ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum);
- nvmf_ctrl_options_put(ctrl->ctrl.opts);
- kfree(ctrl);
+ nvme_delete_ctrl(&ctrl->ctrl);
}

static void
nvme_fc_ctrl_put(struct nvme_fc_ctrl *ctrl)
{
- kref_put(&ctrl->ref, nvme_fc_ctrl_free);
+ kref_put(&ctrl->ref, nvme_fc_ctrl_delete);
}

static int
@@ -2431,9 +2383,18 @@ nvme_fc_free_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);

- WARN_ON(nctrl != &ctrl->ctrl);

- nvme_fc_ctrl_put(ctrl);
+ if (ctrl->ctrl.tagset)
+ nvme_remove_io_tag_set(&ctrl->ctrl);
+
+ nvme_unquiesce_admin_queue(&ctrl->ctrl);
+ nvme_remove_admin_tag_set(&ctrl->ctrl);
+
+ kfree(ctrl->queues);
+
+ ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum);
+ nvmf_ctrl_options_put(ctrl->ctrl.opts);
+ kfree(ctrl);
}

/*
@@ -2682,9 +2643,6 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
if (ctrl->rport->remoteport.port_state != FC_OBJSTATE_ONLINE)
return BLK_STS_RESOURCE;

- if (!nvme_fc_ctrl_get(ctrl))
- return BLK_STS_IOERR;
-
/* format the FC-NVME CMD IU and fcp_req */
cmdiu->connection_id = cpu_to_be64(queue->connection_id);
cmdiu->data_len = cpu_to_be32(data_len);
@@ -2729,7 +2687,6 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
ret = nvme_fc_map_data(ctrl, op->rq, op);
if (ret < 0) {
nvme_cleanup_cmd(op->rq);
- nvme_fc_ctrl_put(ctrl);
if (ret == -ENOMEM || ret == -EAGAIN)
return BLK_STS_RESOURCE;
return BLK_STS_IOERR;
@@ -2770,8 +2727,6 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
nvme_cleanup_cmd(op->rq);
}

- nvme_fc_ctrl_put(ctrl);
-
if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE &&
ret != -EBUSY)
return BLK_STS_IOERR;
@@ -2855,7 +2810,6 @@ nvme_fc_complete_rq(struct request *rq)

nvme_fc_unmap_data(ctrl, rq, op);
nvme_complete_rq(rq);
- nvme_fc_ctrl_put(ctrl);
}

static void nvme_fc_map_queues(struct blk_mq_tag_set *set)
@@ -3284,14 +3238,24 @@ static void
nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);
+ unsigned long flags;

cancel_work_sync(&ctrl->ioerr_work);
cancel_delayed_work_sync(&ctrl->connect_work);
+
+ /* remove from rport list */
+ spin_lock_irqsave(&ctrl->rport->lock, flags);
+ list_del(&ctrl->ctrl_list);
+ spin_unlock_irqrestore(&ctrl->rport->lock, flags);
+
/*
* kill the association on the link side. this will block
* waiting for io to terminate
*/
nvme_fc_delete_association(ctrl);
+
+ nvme_fc_rport_put(ctrl->rport);
+ nvme_fc_lport_put(ctrl->lport);
}

static void
@@ -3344,7 +3308,7 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
ctrl->cnum, min_t(int, portptr->dev_loss_tmo,
(ctrl->ctrl.opts->max_reconnects *
ctrl->ctrl.opts->reconnect_delay)));
- WARN_ON(nvme_delete_ctrl(&ctrl->ctrl));
+ nvme_fc_ctrl_put(ctrl);
}
}

@@ -3502,12 +3466,13 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
INIT_LIST_HEAD(&ctrl->ctrl_list);
ctrl->lport = lport;
ctrl->rport = rport;
+ nvme_fc_lport_get(lport);
+ nvme_fc_rport_get(rport);
ctrl->dev = lport->dev;
ctrl->cnum = idx;
ctrl->ioq_live = false;
init_waitqueue_head(&ctrl->ioabort_wait);

- get_device(ctrl->dev);
kref_init(&ctrl->ref);

INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
@@ -3582,32 +3547,13 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
return &ctrl->ctrl;

fail_ctrl:
- nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING);
- cancel_work_sync(&ctrl->ioerr_work);
- cancel_work_sync(&ctrl->ctrl.reset_work);
- cancel_delayed_work_sync(&ctrl->connect_work);
-
- /* initiate nvme ctrl ref counting teardown */
- nvme_uninit_ctrl(&ctrl->ctrl);
-
- /* Remove core ctrl ref. */
- nvme_put_ctrl(&ctrl->ctrl);
-
- /* as we're past the point where we transition to the ref
- * counting teardown path, if we return a bad pointer here,
- * the calling routine, thinking it's prior to the
- * transition, will do an rport put. Since the teardown
- * path also does a rport put, we do an extra get here to
- * so proper order/teardown happens.
- */
- nvme_fc_rport_get(rport);
+ nvme_fc_ctrl_put(ctrl);

return ERR_PTR(-EIO);

out_free_queues:
kfree(ctrl->queues);
out_free_ida:
- put_device(ctrl->dev);
ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum);
out_free_opts:
nvmf_ctrl_options_put(opts);
@@ -3724,8 +3670,8 @@ nvme_fc_create_ctrl(struct device *dev, struct nvmf_ctrl_options *opts)
spin_unlock_irqrestore(&nvme_fc_lock, flags);

ctrl = nvme_fc_init_ctrl(dev, opts, lport, rport);
- if (IS_ERR(ctrl))
- nvme_fc_rport_put(rport);
+ nvme_fc_rport_put(rport);
+
return ctrl;
}
}
@@ -3950,7 +3896,7 @@ nvme_fc_delete_controllers(struct nvme_fc_rport *rport)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: transport unloading: deleting ctrl\n",
ctrl->cnum);
- nvme_delete_ctrl(&ctrl->ctrl);
+ nvme_fc_ctrl_put(ctrl);
}
spin_unlock(&rport->lock);
}
--
2.43.0

2024-02-16 08:50:21

by Daniel Wagner

[permalink] [raw]

Subject: [PATCH v0 4/6] nvme-fabrics: introduce ref counting for nvmf_ctrl_options

The FC transport is offloading the connect attempt to a workqueue. When
the attempt fails the transport is starting to cleanup resources. It is
possible for user space to trigger a crash because nvmf_ctrl_options are
exposed to sysfs.

This crash wasn't observed with blktests nvme/041 until now because the
retry loop was usually trying for several times (e.g. with defaults
600s) and the test would trigger the cleanup itself. Though we the
recent change not retrying to use invalid credentials the crash can be
easily triggered.

The simplest way to control the life time of nvmf_ctrl_options is by
using ref counting.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fabrics.c | 22 +++++++++++++++++++---
drivers/nvme/host/fabrics.h | 6 +++++-
drivers/nvme/host/fc.c | 14 +++++++++-----
drivers/nvme/host/rdma.c | 18 +++++++++++++-----
drivers/nvme/host/tcp.c | 21 ++++++++++++++-------
drivers/nvme/target/loop.c | 19 +++++++++++++------
6 files changed, 73 insertions(+), 27 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 7d33f0f5824f..3d775718cff7 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -1226,8 +1226,11 @@ static int nvmf_check_allowed_opts(struct nvmf_ctrl_options *opts,
return 0;
}

-void nvmf_free_options(struct nvmf_ctrl_options *opts)
+static void nvmf_free_options(struct kref *ref)
{
+ struct nvmf_ctrl_options *opts =
+ container_of(ref, struct nvmf_ctrl_options, ref);
+
nvmf_host_put(opts->host);
key_put(opts->keyring);
key_put(opts->tls_key);
@@ -1241,7 +1244,18 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
kfree(opts->dhchap_ctrl_secret);
kfree(opts);
}
-EXPORT_SYMBOL_GPL(nvmf_free_options);
+
+int nvmf_ctrl_options_get(struct nvmf_ctrl_options *opts)
+{
+ return kref_get_unless_zero(&opts->ref);
+}
+EXPORT_SYMBOL_GPL(nvmf_ctrl_options_get);
+
+void nvmf_ctrl_options_put(struct nvmf_ctrl_options *opts)
+{
+ kref_put(&opts->ref, nvmf_free_options);
+}
+EXPORT_SYMBOL_GPL(nvmf_ctrl_options_put);

#define NVMF_REQUIRED_OPTS (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN)
#define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
@@ -1263,6 +1277,8 @@ nvmf_create_ctrl(struct device *dev, const char *buf)
if (!opts)
return ERR_PTR(-ENOMEM);

+ kref_init(&opts->ref);
+
ret = nvmf_parse_options(opts, buf);
if (ret)
goto out_free_opts;
@@ -1318,7 +1334,7 @@ nvmf_create_ctrl(struct device *dev, const char *buf)
out_unlock:
up_read(&nvmf_transports_rwsem);
out_free_opts:
- nvmf_free_options(opts);
+ nvmf_ctrl_options_put(opts);
return ERR_PTR(ret);
}

diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 01d3ef545f14..67882e4cbe46 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -79,6 +79,7 @@ enum {
/**
* struct nvmf_ctrl_options - Used to hold the options specified
* with the parsing opts enum.
+ * @ref: for reference count of the data structure
* @mask: Used by the fabrics library to parse through sysfs options
* on adding a NVMe controller.
* @max_reconnects: maximum number of allowed reconnect attempts before removing
@@ -119,6 +120,7 @@ enum {
* @connect_sync: wait for connect attempt(s) to succeed or fail
*/
struct nvmf_ctrl_options {
+ struct kref ref;
unsigned mask;
int max_reconnects;
char *transport;
@@ -149,6 +151,9 @@ struct nvmf_ctrl_options {
bool connect_sync;
};

+int nvmf_ctrl_options_get(struct nvmf_ctrl_options *opts);
+void nvmf_ctrl_options_put(struct nvmf_ctrl_options *opts);
+
/*
* struct nvmf_transport_ops - used to register a specific
* fabric implementation of NVMe fabrics.
@@ -231,7 +236,6 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl);
int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid);
int nvmf_register_transport(struct nvmf_transport_ops *ops);
void nvmf_unregister_transport(struct nvmf_transport_ops *ops);
-void nvmf_free_options(struct nvmf_ctrl_options *opts);
int nvmf_get_address(struct nvme_ctrl *ctrl, char *buf, int size);
bool nvmf_should_reconnect(struct nvme_ctrl *ctrl);
bool nvmf_ip_options_match(struct nvme_ctrl *ctrl,
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index b81046c9f171..ddbc5b21af5b 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2406,8 +2406,7 @@ nvme_fc_ctrl_free(struct kref *ref)
nvme_fc_rport_put(ctrl->rport);

ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum);
- if (ctrl->ctrl.opts)
- nvmf_free_options(ctrl->ctrl.opts);
+ nvmf_ctrl_options_put(ctrl->ctrl.opts);
kfree(ctrl);
}

@@ -3474,10 +3473,15 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
goto out_fail;
}

+ if (!nvmf_ctrl_options_get(opts)) {
+ ret = -ENOLCK;
+ goto out_free_ctrl;
+ }
+
idx = ida_alloc(&nvme_fc_ctrl_cnt, GFP_KERNEL);
if (idx < 0) {
ret = -ENOSPC;
- goto out_free_ctrl;
+ goto out_free_opts;
}

/*
@@ -3583,8 +3587,6 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
cancel_work_sync(&ctrl->ctrl.reset_work);
cancel_delayed_work_sync(&ctrl->connect_work);

- ctrl->ctrl.opts = NULL;
-
/* initiate nvme ctrl ref counting teardown */
nvme_uninit_ctrl(&ctrl->ctrl);

@@ -3607,6 +3609,8 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
out_free_ida:
put_device(ctrl->dev);
ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum);
+out_free_opts:
+ nvmf_ctrl_options_put(opts);
out_free_ctrl:
kfree(ctrl);
out_fail:
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 20fdd40b1879..d3747795ad80 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -976,8 +976,8 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
list_del(&ctrl->list);
mutex_unlock(&nvme_rdma_ctrl_mutex);

- nvmf_free_options(nctrl->opts);
free_ctrl:
+ nvmf_ctrl_options_put(nctrl->opts);
kfree(ctrl->queues);
kfree(ctrl);
}
@@ -2236,6 +2236,12 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
if (!ctrl)
return ERR_PTR(-ENOMEM);
+
+ if (!nvmf_ctrl_options_get(opts)) {
+ ret = -ENOLCK;
+ goto out_free_ctrl;
+ }
+
ctrl->ctrl.opts = opts;
INIT_LIST_HEAD(&ctrl->list);

@@ -2244,7 +2250,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
kstrdup(__stringify(NVME_RDMA_IP_PORT), GFP_KERNEL);
if (!opts->trsvcid) {
ret = -ENOMEM;
- goto out_free_ctrl;
+ goto out_free_opts;
}
opts->mask |= NVMF_OPT_TRSVCID;
}
@@ -2263,13 +2269,13 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
if (ret) {
pr_err("malformed src address passed: %s\n",
opts->host_traddr);
- goto out_free_ctrl;
+ goto out_free_opts;
}
}

if (!opts->duplicate_connect && nvme_rdma_existing_controller(opts)) {
ret = -EALREADY;
- goto out_free_ctrl;
+ goto out_free_opts;
}

INIT_DELAYED_WORK(&ctrl->reconnect_work,
@@ -2286,7 +2292,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
ctrl->queues = kcalloc(ctrl->ctrl.queue_count, sizeof(*ctrl->queues),
GFP_KERNEL);
if (!ctrl->queues)
- goto out_free_ctrl;
+ goto out_free_opts;

ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_rdma_ctrl_ops,
0 /* no quirks, we're perfect! */);
@@ -2317,6 +2323,8 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
return ERR_PTR(ret);
out_kfree_queues:
kfree(ctrl->queues);
+out_free_opts:
+ nvmf_ctrl_options_put(opts);
out_free_ctrl:
kfree(ctrl);
return ERR_PTR(ret);
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a6d596e05602..3b20c5ed033f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2349,8 +2349,8 @@ static void nvme_tcp_free_ctrl(struct nvme_ctrl *nctrl)
list_del(&ctrl->list);
mutex_unlock(&nvme_tcp_ctrl_mutex);

- nvmf_free_options(nctrl->opts);
free_ctrl:
+ nvmf_ctrl_options_put(nctrl->opts);
kfree(ctrl->queues);
kfree(ctrl);
}
@@ -2678,6 +2678,11 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
if (!ctrl)
return ERR_PTR(-ENOMEM);

+ if (!nvmf_ctrl_options_get(opts)) {
+ ret = -ENOLCK;
+ goto out_free_ctrl;
+ }
+
INIT_LIST_HEAD(&ctrl->list);
ctrl->ctrl.opts = opts;
ctrl->ctrl.queue_count = opts->nr_io_queues + opts->nr_write_queues +
@@ -2695,7 +2700,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
kstrdup(__stringify(NVME_TCP_DISC_PORT), GFP_KERNEL);
if (!opts->trsvcid) {
ret = -ENOMEM;
- goto out_free_ctrl;
+ goto out_free_opts;
}
opts->mask |= NVMF_OPT_TRSVCID;
}
@@ -2705,7 +2710,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
if (ret) {
pr_err("malformed address passed: %s:%s\n",
opts->traddr, opts->trsvcid);
- goto out_free_ctrl;
+ goto out_free_opts;
}

if (opts->mask & NVMF_OPT_HOST_TRADDR) {
@@ -2714,7 +2719,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
if (ret) {
pr_err("malformed src address passed: %s\n",
opts->host_traddr);
- goto out_free_ctrl;
+ goto out_free_opts;
}
}

@@ -2723,20 +2728,20 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
pr_err("invalid interface passed: %s\n",
opts->host_iface);
ret = -ENODEV;
- goto out_free_ctrl;
+ goto out_free_opts;
}
}

if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts)) {
ret = -EALREADY;
- goto out_free_ctrl;
+ goto out_free_opts;
}

ctrl->queues = kcalloc(ctrl->ctrl.queue_count, sizeof(*ctrl->queues),
GFP_KERNEL);
if (!ctrl->queues) {
ret = -ENOMEM;
- goto out_free_ctrl;
+ goto out_free_opts;
}

ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_tcp_ctrl_ops, 0);
@@ -2770,6 +2775,8 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
return ERR_PTR(ret);
out_kfree_queues:
kfree(ctrl->queues);
+out_free_opts:
+ nvmf_ctrl_options_put(opts);
out_free_ctrl:
kfree(ctrl);
return ERR_PTR(ret);
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index e589915ddef8..de2ff7ed0657 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -283,8 +283,8 @@ static void nvme_loop_free_ctrl(struct nvme_ctrl *nctrl)
if (nctrl->tagset)
nvme_remove_io_tag_set(nctrl);
kfree(ctrl->queues);
- nvmf_free_options(nctrl->opts);
free_ctrl:
+ nvmf_ctrl_options_put(nctrl->opts);
kfree(ctrl);
}

@@ -543,6 +543,12 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,
ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
if (!ctrl)
return ERR_PTR(-ENOMEM);
+
+ if (!nvmf_ctrl_options_get(opts)) {
+ ret = -ENOLCK;
+ goto out_free_ctrl;
+ }
+
ctrl->ctrl.opts = opts;
INIT_LIST_HEAD(&ctrl->list);

@@ -550,10 +556,8 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,

ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_loop_ctrl_ops,
0 /* no quirks, we're perfect! */);
- if (ret) {
- kfree(ctrl);
- goto out;
- }
+ if (ret)
+ goto out_free_opts;

if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
WARN_ON_ONCE(1);
@@ -612,7 +616,10 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,
out_uninit_ctrl:
nvme_uninit_ctrl(&ctrl->ctrl);
nvme_put_ctrl(&ctrl->ctrl);
-out:
+out_free_opts:
+ nvmf_ctrl_options_put(opts);
+out_free_ctrl:
+ kfree(ctrl);
if (ret > 0)
ret = -EIO;
return ERR_PTR(ret);
--
2.43.0

2024-02-16 08:50:47

by Daniel Wagner

[permalink] [raw]

Subject: [PATCH v0 6/6] nvme-fc: wait for connect attempt to finish

The FC transport offloads the connect attempt to a workqueue. Thus
userspace is not able to wait on the result.

Thus, allow userspace to wait on the connect result by honnering the
'connect_sync' connect option.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fc.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 7f9edab57550..5f1d0165de40 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -169,6 +169,7 @@ struct nvme_fc_ctrl {

struct work_struct ioerr_work;
struct delayed_work connect_work;
+ struct completion connect_completion;

struct kref ref;
unsigned long flags;
@@ -803,6 +804,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: Couldn't schedule reset.\n",
ctrl->cnum);
+ complete(&ctrl->connect_completion);
nvme_fc_ctrl_put(ctrl);
}
break;
@@ -871,6 +873,7 @@ nvme_fc_unregister_remoteport(struct nvme_fc_remote_port *portptr)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: controller connectivity lost.\n",
ctrl->cnum);
+ complete(&ctrl->connect_completion);
nvme_fc_ctrl_put(ctrl);
} else
nvme_fc_ctrl_connectivity_loss(ctrl);
@@ -3308,6 +3311,7 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
ctrl->cnum, min_t(int, portptr->dev_loss_tmo,
(ctrl->ctrl.opts->max_reconnects *
ctrl->ctrl.opts->reconnect_delay)));
+ complete(&ctrl->connect_completion);
nvme_fc_ctrl_put(ctrl);
}
}
@@ -3367,10 +3371,12 @@ nvme_fc_connect_ctrl_work(struct work_struct *work)
ret = nvme_fc_create_association(ctrl);
if (ret)
nvme_fc_reconnect_or_delete(ctrl, ret);
- else
+ else {
dev_info(ctrl->ctrl.device,
"NVME-FC{%d}: controller connect complete\n",
ctrl->cnum);
+ complete(&ctrl->connect_completion);
+ }
}

@@ -3477,6 +3483,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,

INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
+ init_completion(&ctrl->connect_completion);
INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
spin_lock_init(&ctrl->lock);

@@ -3524,6 +3531,9 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
list_add_tail(&ctrl->ctrl_list, &rport->ctrl_list);
spin_unlock_irqrestore(&rport->lock, flags);

+ if (opts->connect_sync)
+ nvme_fc_ctrl_get(ctrl);
+
if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING) ||
!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
dev_err(ctrl->ctrl.device,
@@ -3540,6 +3550,19 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,

flush_delayed_work(&ctrl->connect_work);

+ if (opts->connect_sync) {
+ enum nvme_ctrl_state state;
+
+ wait_for_completion(&ctrl->connect_completion);
+ state = nvme_ctrl_state(&ctrl->ctrl);
+ nvme_fc_ctrl_put(ctrl);
+
+ if (state != NVME_CTRL_LIVE) {
+ /* Cleanup is handled by the connect state machine */
+ return ERR_PTR(-EIO);
+ }
+ }
+
dev_info(ctrl->ctrl.device,
"NVME-FC{%d}: new ctrl: NQN \"%s\", hostnqn: %s\n",
ctrl->cnum, nvmf_ctrl_subsysnqn(&ctrl->ctrl), opts->host->nqn);
@@ -3896,6 +3919,7 @@ nvme_fc_delete_controllers(struct nvme_fc_rport *rport)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: transport unloading: deleting ctrl\n",
ctrl->cnum);
+ complete(&ctrl->connect_completion);
nvme_fc_ctrl_put(ctrl);
}
spin_unlock(&rport->lock);
--
2.43.0

2024-02-16 09:49:30

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v0 1/6] nvme-fabrics: introduce connect_sync option

On Fri, Feb 16, 2024 at 09:45:21AM +0100, Daniel Wagner wrote:
> The TCP and RDMA transport are doing a synchronous connect, meaning the
> syscal returns with the final result, that is. it either failed or
> succeeded.
>
> This isn't the case for FC. This transport just setups and triggers
> the connect and returns without waiting on the result.

That's really weird and unexpected. James, can you explain the reason
behind this?

> Introduce a flag
> to allow user space to control the behavior, wait or don't wait.

I'd expect this to be the default, but I'll wait to hear more about
the rationale. If we keep the async default the option looks sensible.

2024-02-16 09:50:14

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v0 2/6] nvme-fc: rename free_ctrl callback to match name pattern

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2024-02-16 09:51:22

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v0 3/6] nvme-fc: do not retry when auth fails or connection is refused

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2024-02-16 09:51:23

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v0 4/6] nvme-fabrics: introduce ref counting for nvmf_ctrl_options

On Fri, Feb 16, 2024 at 09:45:24AM +0100, Daniel Wagner wrote:
> The FC transport is offloading the connect attempt to a workqueue. When
> the attempt fails the transport is starting to cleanup resources. It is
> possible for user space to trigger a crash because nvmf_ctrl_options are
> exposed to sysfs.

Eww. I think the async offload is the real problem here..

2024-02-16 10:02:32

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v0 5/6] nvme-fc: redesign locking and refcounting

This looks reasonable to me. Without much further reading I don't
really feel qualified to comment on nvme-fc locking and refcounting,
though.

2024-02-16 11:10:32

by Hannes Reinecke

[permalink] [raw]

Subject: Re: [PATCH v0 5/6] nvme-fc: redesign locking and refcounting

On 2/16/24 09:45, Daniel Wagner wrote:
> The life time of the controller is managed by the upper layers.
>
> Thus just ref counting the controller when creating it and giving the
> ref back on the cleanup path. This is how the other transport are
> managed as well. Until now, the ref count has been taken per LS request
> which is not really necessary as the core guarantees that there is no in
> flight request when shuting down (if we use the nvme APIs are used
> correctly).
>
> In fact we don't really need the ref count for nvme_fc_ctrl at this
> point. Though, the FC transport is offloading the connect attempt to a
> workqueue and in the next patch we introduce a sync option for which the
> ref counter is necessary. So let's keep it around.
>
> Also take a ref for lport and rport when creating the controller and
> give it back when we destroy the controller. This means these refs are
> tied to the life time of the controller and not the other way around.
>
> We have also to reorder the cleanup code in nvme_fc_delete_ctrl and
> nvme_fc_free_ctrl so that we do not expose resources too long and run
> into use after free situations which are currently possible.
>
> Signed-off-by: Daniel Wagner <[email protected]>
> ---
> drivers/nvme/host/fc.c | 136 +++++++++++++----------------------------
> 1 file changed, 41 insertions(+), 95 deletions(-)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index ddbc5b21af5b..7f9edab57550 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -229,6 +229,9 @@ static struct device *fc_udev_device;
>
> static void nvme_fc_complete_rq(struct request *rq);
>
> +static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
> +static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
> +
> /* *********************** FC-NVME Port Management ************************ */
>
> static void __nvme_fc_delete_hw_queue(struct nvme_fc_ctrl *,
> @@ -800,7 +803,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
> dev_warn(ctrl->ctrl.device,
> "NVME-FC{%d}: Couldn't schedule reset.\n",
> ctrl->cnum);
> - nvme_delete_ctrl(&ctrl->ctrl);
> + nvme_fc_ctrl_put(ctrl);
> }
> break;
>
> @@ -868,7 +871,7 @@ nvme_fc_unregister_remoteport(struct nvme_fc_remote_port *portptr)
> dev_warn(ctrl->ctrl.device,
> "NVME-FC{%d}: controller connectivity lost.\n",
> ctrl->cnum);
> - nvme_delete_ctrl(&ctrl->ctrl);
> + nvme_fc_ctrl_put(ctrl);
> } else
> nvme_fc_ctrl_connectivity_loss(ctrl);
> }
> @@ -1022,9 +1025,6 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>
> /* *********************** FC-NVME LS Handling **************************** */
>
> -static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
> -static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
> -
> static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
>
> static void
> @@ -1050,8 +1050,6 @@ __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
> fc_dma_unmap_single(rport->dev, lsreq->rqstdma,
> (lsreq->rqstlen + lsreq->rsplen),
> DMA_BIDIRECTIONAL);
> -
> - nvme_fc_rport_put(rport);
> }
>
Hmm. I'm a bit unsure about this; essentially you change the rport
refcounting (and not just the controller refcounting).
And the problem here is that rport refcounting is actually tied to
the driver-internal rports, which have a different lifetime
(dev_loss_tmo and all that).

Would it be possible to break this in two, with one patch changing the
controller/options refcounting and the other one changing the rport
refcounting?

Cheers,

Hannes

2024-02-16 12:40:33

by Daniel Wagner

[permalink] [raw]

Subject: Re: Re: [PATCH v0 5/6] nvme-fc: redesign locking and refcounting

On Fri, Feb 16, 2024 at 12:09:20PM +0100, Hannes Reinecke wrote:
> Hmm. I'm a bit unsure about this; essentially you change the rport
> refcounting (and not just the controller refcounting).
> And the problem here is that rport refcounting is actually tied to
> the driver-internal rports, which have a different lifetime
> (dev_loss_tmo and all that).
>
> Would it be possible to break this in two, with one patch changing the
> controller/options refcounting and the other one changing the rport
> refcounting?

Yeah, I see. I completely forgot about the dev_loss_tmo thing. I'll try
to split this patch.

2024-02-16 16:44:16

by Daniel Wagner

[permalink] [raw]

Subject: Re: [PATCH v0 1/6] nvme-fabrics: introduce connect_sync option

On Fri, Feb 16, 2024 at 10:49:09AM +0100, Christoph Hellwig wrote:
> On Fri, Feb 16, 2024 at 09:45:21AM +0100, Daniel Wagner wrote:
> > The TCP and RDMA transport are doing a synchronous connect, meaning the
> > syscal returns with the final result, that is. it either failed or
> > succeeded.
> >
> > This isn't the case for FC. This transport just setups and triggers
> > the connect and returns without waiting on the result.
>
> That's really weird and unexpected. James, can you explain the reason
> behind this?

James answered this point on my attempt to make this synchronous:

https://lore.kernel.org/linux-nvme/[email protected]/

> > Introduce a flag
> > to allow user space to control the behavior, wait or don't wait.
>
> I'd expect this to be the default, but I'll wait to hear more about
> the rationale. If we keep the async default the option looks sensible.

Ideally, we could agree on behavior which is the same for all
transports.

2024-02-17 16:28:24

by Hannes Reinecke

[permalink] [raw]

Subject: Re: [PATCH v0 1/6] nvme-fabrics: introduce connect_sync option

On 2/16/24 10:49, Christoph Hellwig wrote:
> On Fri, Feb 16, 2024 at 09:45:21AM +0100, Daniel Wagner wrote:
>> The TCP and RDMA transport are doing a synchronous connect, meaning the
>> syscal returns with the final result, that is. it either failed or
>> succeeded.
>>
>> This isn't the case for FC. This transport just setups and triggers
>> the connect and returns without waiting on the result.
>
> That's really weird and unexpected. James, can you explain the reason
> behind this?
>
Reason is that the initial connect attempt might fail with an temporary
failure, and will need to be retried. And rather than implementing two
methods for handling this (one for the initial connect, and another one
for reconnect where one _has_ to use a workqueue) as eg TCP and RDMA
has implemented it FC is using a single code path for handling both.

Temporary failure on initial connect is far more likely on FC than on
other transports due to the way how FC-NVMe is modelled; essentially
one has to log into the remote port for each protocol. So if you run
in a dual fabric (with both FCP and NVMe) you'll need to log into the
same remote port twice. Depending on the implementation the target might
only be capable of handling one port login at the same time, so the
other one will be failed with a temporary error.
That's why it's a common issue with FC. It _might_ happen with TCP, too,
but apparently not regularly otherwise we would have seen quite some
failures here; TCP can't really handle temporary failures for the
initial connect.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

2024-02-20 06:51:50

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v0 1/6] nvme-fabrics: introduce connect_sync option

On Fri, Feb 16, 2024 at 05:44:02PM +0100, Daniel Wagner wrote:
> James answered this point on my attempt to make this synchronous:
>
> https://lore.kernel.org/linux-nvme/[email protected]/

That needs to go into the commit log. And I call complete BS on that
to be honest.

> Ideally, we could agree on behavior which is the same for all
> transports.

sync and async opt in it is. I'm still pissed FC did this differently
without any proper discussion of the tradeoffs.