2024-02-21 13:24:28

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 0/5] nvme-fc: fix blktests nvme/041

As suggested by Keith I've merged the new flag for nvme-fabrics with the last
patch. I also added some information why nvme-fc is handling the initial connect
attempt differently as requested by Christoph.

The new flag is called connect_async and the default is false/0. I've tested
with a patched nvme-cli/libnvme version and all looks good. When this series is
accepted I'll update nvme-cli/libnvme accordingly. Note, that even with an older
nvme-cli the blktests (especially nvme/041) will pass with a newer kernel.

(nvme/048 is still fails but this is a different problem. On my TODO list)

changes:
v2:
- renamed flag to connect_async, default false
- add info to commit message why nvme-fc is different
- merged connect_async with 'nvme-fc: wait for
initial connect attempt to finish'

v1:
- renamed 'nvme-fc: redesign locking and refcounting'
to 'nvme-fc: reorder ctrl ref counting and cleanup code path'
- testing with scsi/nvme dev_loss_tmo on real hw
- removed rport ref counting part
- collected RB tags
- https://lore.kernel.org/linux-nvme/[email protected]/

v0:
- initial version
- https://lore.kernel.org/linux-nvme/[email protected]/

Daniel Wagner (5):
nvme-fc: rename free_ctrl callback to match name pattern
nvme-fc: do not retry when auth fails or connection is refused
nvme-fabrics: introduce ref counting for nvmf_ctrl_options
nvme-fc: reorder ctrl ref counting and cleanup code path
nvme-fc: wait for initial connect attempt to finish

drivers/nvme/host/fabrics.c | 39 +++++++++-
drivers/nvme/host/fabrics.h | 9 ++-
drivers/nvme/host/fc.c | 145 +++++++++++++++++-------------------
drivers/nvme/host/rdma.c | 18 +++--
drivers/nvme/host/tcp.c | 21 ++++--
drivers/nvme/target/loop.c | 19 +++--
6 files changed, 152 insertions(+), 99 deletions(-)

--
2.43.1



2024-02-21 13:24:42

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 1/5] nvme-fc: rename free_ctrl callback to match name pattern

Rename nvme_fc_nvme_ctrl_freed to nvme_fc_free_ctrl to match the name
pattern for the callback.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 68a5d971657b..a5b29e9ad342 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2428,7 +2428,7 @@ nvme_fc_ctrl_get(struct nvme_fc_ctrl *ctrl)
* controller. Called after last nvme_put_ctrl() call
*/
static void
-nvme_fc_nvme_ctrl_freed(struct nvme_ctrl *nctrl)
+nvme_fc_free_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);

@@ -3384,7 +3384,7 @@ static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
.reg_read32 = nvmf_reg_read32,
.reg_read64 = nvmf_reg_read64,
.reg_write32 = nvmf_reg_write32,
- .free_ctrl = nvme_fc_nvme_ctrl_freed,
+ .free_ctrl = nvme_fc_free_ctrl,
.submit_async_event = nvme_fc_submit_async_event,
.delete_ctrl = nvme_fc_delete_ctrl,
.get_address = nvmf_get_address,
--
2.43.1


2024-02-21 13:24:44

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 2/5] nvme-fc: do not retry when auth fails or connection is refused

There is no point in retrying to connect if the authentication fails.

Connection refused is also issued from the authentication path, thus
also do not retry.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fc.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index a5b29e9ad342..b81046c9f171 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3312,6 +3312,8 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
ctrl->cnum, status);
if (status > 0 && (status & NVME_SC_DNR))
recon = false;
+ if (status == NVME_SC_AUTH_REQUIRED || status == -ECONNREFUSED)
+ recon = false;
} else if (time_after_eq(jiffies, rport->dev_loss_end))
recon = false;

--
2.43.1


2024-02-21 13:25:04

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 4/5] nvme-fc: reorder ctrl ref counting and cleanup code path

The life time of the controller is managed by the upper layers.

Thus just ref counting the controller when creating it and giving the
ref back on the cleanup path. This is how the other transport are
managed as well.

In fact we don't really need the ref count for nvme_fc_ctrl at this
point. Though, the FC transport is offloading the connect attempt to a
workqueue and in the next patch we introduce a sync option for which the
ref counter is necessary. So let's keep it around.

We have also to reorder the cleanup code in nvme_fc_delete_ctrl and
nvme_fc_free_ctrl so that we do not expose resources too long and run
into use after free situations which are currently possible.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fc.c | 101 +++++++++++++----------------------------
1 file changed, 32 insertions(+), 69 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index ddbc5b21af5b..7627d10a5812 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -229,6 +229,9 @@ static struct device *fc_udev_device;

static void nvme_fc_complete_rq(struct request *rq);

+static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
+static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
+
/* *********************** FC-NVME Port Management ************************ */

static void __nvme_fc_delete_hw_queue(struct nvme_fc_ctrl *,
@@ -800,7 +803,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: Couldn't schedule reset.\n",
ctrl->cnum);
- nvme_delete_ctrl(&ctrl->ctrl);
+ nvme_fc_ctrl_put(ctrl);
}
break;

@@ -868,7 +871,7 @@ nvme_fc_unregister_remoteport(struct nvme_fc_remote_port *portptr)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: controller connectivity lost.\n",
ctrl->cnum);
- nvme_delete_ctrl(&ctrl->ctrl);
+ nvme_fc_ctrl_put(ctrl);
} else
nvme_fc_ctrl_connectivity_loss(ctrl);
}
@@ -1022,9 +1025,6 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,

/* *********************** FC-NVME LS Handling **************************** */

-static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
-static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
-
static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);

static void
@@ -1511,8 +1511,6 @@ nvme_fc_match_disconn_ls(struct nvme_fc_rport *rport,
spin_lock_irqsave(&rport->lock, flags);

list_for_each_entry(ctrl, &rport->ctrl_list, ctrl_list) {
- if (!nvme_fc_ctrl_get(ctrl))
- continue;
spin_lock(&ctrl->lock);
if (association_id == ctrl->association_id) {
oldls = ctrl->rcv_disconn;
@@ -1520,10 +1518,6 @@ nvme_fc_match_disconn_ls(struct nvme_fc_rport *rport,
ret = ctrl;
}
spin_unlock(&ctrl->lock);
- if (ret)
- /* leave the ctrl get reference */
- break;
- nvme_fc_ctrl_put(ctrl);
}

spin_unlock_irqrestore(&rport->lock, flags);
@@ -1602,9 +1596,6 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
/* fail the association */
nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");

- /* release the reference taken by nvme_fc_match_disconn_ls() */
- nvme_fc_ctrl_put(ctrl);
-
return false;
}

@@ -2071,7 +2062,6 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
atomic_set(&op->state, FCPOP_STATE_IDLE);
op->flags = FCOP_FLAGS_AEN; /* clear other flags */
- nvme_fc_ctrl_put(ctrl);
goto check_error;
}

@@ -2383,37 +2373,18 @@ nvme_fc_init_io_queues(struct nvme_fc_ctrl *ctrl)
}

static void
-nvme_fc_ctrl_free(struct kref *ref)
+nvme_fc_ctrl_delete(struct kref *ref)
{
struct nvme_fc_ctrl *ctrl =
container_of(ref, struct nvme_fc_ctrl, ref);
- unsigned long flags;
-
- if (ctrl->ctrl.tagset)
- nvme_remove_io_tag_set(&ctrl->ctrl);
-
- /* remove from rport list */
- spin_lock_irqsave(&ctrl->rport->lock, flags);
- list_del(&ctrl->ctrl_list);
- spin_unlock_irqrestore(&ctrl->rport->lock, flags);
-
- nvme_unquiesce_admin_queue(&ctrl->ctrl);
- nvme_remove_admin_tag_set(&ctrl->ctrl);

- kfree(ctrl->queues);
-
- put_device(ctrl->dev);
- nvme_fc_rport_put(ctrl->rport);
-
- ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum);
- nvmf_ctrl_options_put(ctrl->ctrl.opts);
- kfree(ctrl);
+ nvme_delete_ctrl(&ctrl->ctrl);
}

static void
nvme_fc_ctrl_put(struct nvme_fc_ctrl *ctrl)
{
- kref_put(&ctrl->ref, nvme_fc_ctrl_free);
+ kref_put(&ctrl->ref, nvme_fc_ctrl_delete);
}

static int
@@ -2431,9 +2402,20 @@ nvme_fc_free_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);

- WARN_ON(nctrl != &ctrl->ctrl);

- nvme_fc_ctrl_put(ctrl);
+ if (ctrl->ctrl.tagset)
+ nvme_remove_io_tag_set(&ctrl->ctrl);
+
+ nvme_unquiesce_admin_queue(&ctrl->ctrl);
+ nvme_remove_admin_tag_set(&ctrl->ctrl);
+
+ kfree(ctrl->queues);
+ put_device(ctrl->dev);
+ nvme_fc_rport_put(ctrl->rport);
+
+ ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum);
+ nvmf_ctrl_options_put(ctrl->ctrl.opts);
+ kfree(ctrl);
}

/*
@@ -2682,9 +2664,6 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
if (ctrl->rport->remoteport.port_state != FC_OBJSTATE_ONLINE)
return BLK_STS_RESOURCE;

- if (!nvme_fc_ctrl_get(ctrl))
- return BLK_STS_IOERR;
-
/* format the FC-NVME CMD IU and fcp_req */
cmdiu->connection_id = cpu_to_be64(queue->connection_id);
cmdiu->data_len = cpu_to_be32(data_len);
@@ -2729,7 +2708,6 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
ret = nvme_fc_map_data(ctrl, op->rq, op);
if (ret < 0) {
nvme_cleanup_cmd(op->rq);
- nvme_fc_ctrl_put(ctrl);
if (ret == -ENOMEM || ret == -EAGAIN)
return BLK_STS_RESOURCE;
return BLK_STS_IOERR;
@@ -2770,8 +2748,6 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
nvme_cleanup_cmd(op->rq);
}

- nvme_fc_ctrl_put(ctrl);
-
if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE &&
ret != -EBUSY)
return BLK_STS_IOERR;
@@ -2855,7 +2831,6 @@ nvme_fc_complete_rq(struct request *rq)

nvme_fc_unmap_data(ctrl, rq, op);
nvme_complete_rq(rq);
- nvme_fc_ctrl_put(ctrl);
}

static void nvme_fc_map_queues(struct blk_mq_tag_set *set)
@@ -3284,9 +3259,16 @@ static void
nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);
+ unsigned long flags;

cancel_work_sync(&ctrl->ioerr_work);
cancel_delayed_work_sync(&ctrl->connect_work);
+
+ /* remove from rport list */
+ spin_lock_irqsave(&ctrl->rport->lock, flags);
+ list_del(&ctrl->ctrl_list);
+ spin_unlock_irqrestore(&ctrl->rport->lock, flags);
+
/*
* kill the association on the link side. this will block
* waiting for io to terminate
@@ -3344,7 +3326,7 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
ctrl->cnum, min_t(int, portptr->dev_loss_tmo,
(ctrl->ctrl.opts->max_reconnects *
ctrl->ctrl.opts->reconnect_delay)));
- WARN_ON(nvme_delete_ctrl(&ctrl->ctrl));
+ nvme_fc_ctrl_put(ctrl);
}
}

@@ -3582,25 +3564,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
return &ctrl->ctrl;

fail_ctrl:
- nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING);
- cancel_work_sync(&ctrl->ioerr_work);
- cancel_work_sync(&ctrl->ctrl.reset_work);
- cancel_delayed_work_sync(&ctrl->connect_work);
-
- /* initiate nvme ctrl ref counting teardown */
- nvme_uninit_ctrl(&ctrl->ctrl);
-
- /* Remove core ctrl ref. */
- nvme_put_ctrl(&ctrl->ctrl);
-
- /* as we're past the point where we transition to the ref
- * counting teardown path, if we return a bad pointer here,
- * the calling routine, thinking it's prior to the
- * transition, will do an rport put. Since the teardown
- * path also does a rport put, we do an extra get here to
- * so proper order/teardown happens.
- */
- nvme_fc_rport_get(rport);
+ nvme_fc_ctrl_put(ctrl);

return ERR_PTR(-EIO);

@@ -3614,6 +3578,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
out_free_ctrl:
kfree(ctrl);
out_fail:
+ nvme_fc_rport_put(rport);
/* exit via here doesn't follow ctlr ref points */
return ERR_PTR(ret);
}
@@ -3724,8 +3689,6 @@ nvme_fc_create_ctrl(struct device *dev, struct nvmf_ctrl_options *opts)
spin_unlock_irqrestore(&nvme_fc_lock, flags);

ctrl = nvme_fc_init_ctrl(dev, opts, lport, rport);
- if (IS_ERR(ctrl))
- nvme_fc_rport_put(rport);
return ctrl;
}
}
@@ -3950,7 +3913,7 @@ nvme_fc_delete_controllers(struct nvme_fc_rport *rport)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: transport unloading: deleting ctrl\n",
ctrl->cnum);
- nvme_delete_ctrl(&ctrl->ctrl);
+ nvme_fc_ctrl_put(ctrl);
}
spin_unlock(&rport->lock);
}
--
2.43.1


2024-02-21 13:25:39

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 3/5] nvme-fabrics: introduce ref counting for nvmf_ctrl_options

The FC transport is offloading the connect attempt to a workqueue. When
the attempt fails the transport is starting to cleanup resources. It is
possible for user space to trigger a crash because nvmf_ctrl_options are
exposed to sysfs.

This crash wasn't observed with blktests nvme/041 until now because the
retry loop was usually trying for several times (e.g. with defaults
600s) and the test would trigger the cleanup itself. Though we the
recent change not retrying to use invalid credentials the crash can be
easily triggered.

The simplest way to control the life time of nvmf_ctrl_options is by
using ref counting.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fabrics.c | 22 +++++++++++++++++++---
drivers/nvme/host/fabrics.h | 6 +++++-
drivers/nvme/host/fc.c | 14 +++++++++-----
drivers/nvme/host/rdma.c | 18 +++++++++++++-----
drivers/nvme/host/tcp.c | 21 ++++++++++++++-------
drivers/nvme/target/loop.c | 19 +++++++++++++------
6 files changed, 73 insertions(+), 27 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 3499acbf6a82..888285fe2289 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -1222,8 +1222,11 @@ static int nvmf_check_allowed_opts(struct nvmf_ctrl_options *opts,
return 0;
}

-void nvmf_free_options(struct nvmf_ctrl_options *opts)
+static void nvmf_free_options(struct kref *ref)
{
+ struct nvmf_ctrl_options *opts =
+ container_of(ref, struct nvmf_ctrl_options, ref);
+
nvmf_host_put(opts->host);
key_put(opts->keyring);
key_put(opts->tls_key);
@@ -1237,7 +1240,18 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
kfree(opts->dhchap_ctrl_secret);
kfree(opts);
}
-EXPORT_SYMBOL_GPL(nvmf_free_options);
+
+int nvmf_ctrl_options_get(struct nvmf_ctrl_options *opts)
+{
+ return kref_get_unless_zero(&opts->ref);
+}
+EXPORT_SYMBOL_GPL(nvmf_ctrl_options_get);
+
+void nvmf_ctrl_options_put(struct nvmf_ctrl_options *opts)
+{
+ kref_put(&opts->ref, nvmf_free_options);
+}
+EXPORT_SYMBOL_GPL(nvmf_ctrl_options_put);

#define NVMF_REQUIRED_OPTS (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN)
#define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
@@ -1259,6 +1273,8 @@ nvmf_create_ctrl(struct device *dev, const char *buf)
if (!opts)
return ERR_PTR(-ENOMEM);

+ kref_init(&opts->ref);
+
ret = nvmf_parse_options(opts, buf);
if (ret)
goto out_free_opts;
@@ -1314,7 +1330,7 @@ nvmf_create_ctrl(struct device *dev, const char *buf)
out_unlock:
up_read(&nvmf_transports_rwsem);
out_free_opts:
- nvmf_free_options(opts);
+ nvmf_ctrl_options_put(opts);
return ERR_PTR(ret);
}

diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 06cc54851b1b..8436533aed16 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -78,6 +78,7 @@ enum {
/**
* struct nvmf_ctrl_options - Used to hold the options specified
* with the parsing opts enum.
+ * @ref: for reference count of the data structure
* @mask: Used by the fabrics library to parse through sysfs options
* on adding a NVMe controller.
* @max_reconnects: maximum number of allowed reconnect attempts before removing
@@ -117,6 +118,7 @@ enum {
* @fast_io_fail_tmo: Fast I/O fail timeout in seconds
*/
struct nvmf_ctrl_options {
+ struct kref ref;
unsigned mask;
int max_reconnects;
char *transport;
@@ -146,6 +148,9 @@ struct nvmf_ctrl_options {
int fast_io_fail_tmo;
};

+int nvmf_ctrl_options_get(struct nvmf_ctrl_options *opts);
+void nvmf_ctrl_options_put(struct nvmf_ctrl_options *opts);
+
/*
* struct nvmf_transport_ops - used to register a specific
* fabric implementation of NVMe fabrics.
@@ -228,7 +233,6 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl);
int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid);
int nvmf_register_transport(struct nvmf_transport_ops *ops);
void nvmf_unregister_transport(struct nvmf_transport_ops *ops);
-void nvmf_free_options(struct nvmf_ctrl_options *opts);
int nvmf_get_address(struct nvme_ctrl *ctrl, char *buf, int size);
bool nvmf_should_reconnect(struct nvme_ctrl *ctrl);
bool nvmf_ip_options_match(struct nvme_ctrl *ctrl,
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index b81046c9f171..ddbc5b21af5b 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2406,8 +2406,7 @@ nvme_fc_ctrl_free(struct kref *ref)
nvme_fc_rport_put(ctrl->rport);

ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum);
- if (ctrl->ctrl.opts)
- nvmf_free_options(ctrl->ctrl.opts);
+ nvmf_ctrl_options_put(ctrl->ctrl.opts);
kfree(ctrl);
}

@@ -3474,10 +3473,15 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
goto out_fail;
}

+ if (!nvmf_ctrl_options_get(opts)) {
+ ret = -ENOLCK;
+ goto out_free_ctrl;
+ }
+
idx = ida_alloc(&nvme_fc_ctrl_cnt, GFP_KERNEL);
if (idx < 0) {
ret = -ENOSPC;
- goto out_free_ctrl;
+ goto out_free_opts;
}

/*
@@ -3583,8 +3587,6 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
cancel_work_sync(&ctrl->ctrl.reset_work);
cancel_delayed_work_sync(&ctrl->connect_work);

- ctrl->ctrl.opts = NULL;
-
/* initiate nvme ctrl ref counting teardown */
nvme_uninit_ctrl(&ctrl->ctrl);

@@ -3607,6 +3609,8 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
out_free_ida:
put_device(ctrl->dev);
ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum);
+out_free_opts:
+ nvmf_ctrl_options_put(opts);
out_free_ctrl:
kfree(ctrl);
out_fail:
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 20fdd40b1879..d3747795ad80 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -976,8 +976,8 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
list_del(&ctrl->list);
mutex_unlock(&nvme_rdma_ctrl_mutex);

- nvmf_free_options(nctrl->opts);
free_ctrl:
+ nvmf_ctrl_options_put(nctrl->opts);
kfree(ctrl->queues);
kfree(ctrl);
}
@@ -2236,6 +2236,12 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
if (!ctrl)
return ERR_PTR(-ENOMEM);
+
+ if (!nvmf_ctrl_options_get(opts)) {
+ ret = -ENOLCK;
+ goto out_free_ctrl;
+ }
+
ctrl->ctrl.opts = opts;
INIT_LIST_HEAD(&ctrl->list);

@@ -2244,7 +2250,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
kstrdup(__stringify(NVME_RDMA_IP_PORT), GFP_KERNEL);
if (!opts->trsvcid) {
ret = -ENOMEM;
- goto out_free_ctrl;
+ goto out_free_opts;
}
opts->mask |= NVMF_OPT_TRSVCID;
}
@@ -2263,13 +2269,13 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
if (ret) {
pr_err("malformed src address passed: %s\n",
opts->host_traddr);
- goto out_free_ctrl;
+ goto out_free_opts;
}
}

if (!opts->duplicate_connect && nvme_rdma_existing_controller(opts)) {
ret = -EALREADY;
- goto out_free_ctrl;
+ goto out_free_opts;
}

INIT_DELAYED_WORK(&ctrl->reconnect_work,
@@ -2286,7 +2292,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
ctrl->queues = kcalloc(ctrl->ctrl.queue_count, sizeof(*ctrl->queues),
GFP_KERNEL);
if (!ctrl->queues)
- goto out_free_ctrl;
+ goto out_free_opts;

ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_rdma_ctrl_ops,
0 /* no quirks, we're perfect! */);
@@ -2317,6 +2323,8 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
return ERR_PTR(ret);
out_kfree_queues:
kfree(ctrl->queues);
+out_free_opts:
+ nvmf_ctrl_options_put(opts);
out_free_ctrl:
kfree(ctrl);
return ERR_PTR(ret);
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a6d596e05602..3b20c5ed033f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2349,8 +2349,8 @@ static void nvme_tcp_free_ctrl(struct nvme_ctrl *nctrl)
list_del(&ctrl->list);
mutex_unlock(&nvme_tcp_ctrl_mutex);

- nvmf_free_options(nctrl->opts);
free_ctrl:
+ nvmf_ctrl_options_put(nctrl->opts);
kfree(ctrl->queues);
kfree(ctrl);
}
@@ -2678,6 +2678,11 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
if (!ctrl)
return ERR_PTR(-ENOMEM);

+ if (!nvmf_ctrl_options_get(opts)) {
+ ret = -ENOLCK;
+ goto out_free_ctrl;
+ }
+
INIT_LIST_HEAD(&ctrl->list);
ctrl->ctrl.opts = opts;
ctrl->ctrl.queue_count = opts->nr_io_queues + opts->nr_write_queues +
@@ -2695,7 +2700,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
kstrdup(__stringify(NVME_TCP_DISC_PORT), GFP_KERNEL);
if (!opts->trsvcid) {
ret = -ENOMEM;
- goto out_free_ctrl;
+ goto out_free_opts;
}
opts->mask |= NVMF_OPT_TRSVCID;
}
@@ -2705,7 +2710,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
if (ret) {
pr_err("malformed address passed: %s:%s\n",
opts->traddr, opts->trsvcid);
- goto out_free_ctrl;
+ goto out_free_opts;
}

if (opts->mask & NVMF_OPT_HOST_TRADDR) {
@@ -2714,7 +2719,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
if (ret) {
pr_err("malformed src address passed: %s\n",
opts->host_traddr);
- goto out_free_ctrl;
+ goto out_free_opts;
}
}

@@ -2723,20 +2728,20 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
pr_err("invalid interface passed: %s\n",
opts->host_iface);
ret = -ENODEV;
- goto out_free_ctrl;
+ goto out_free_opts;
}
}

if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts)) {
ret = -EALREADY;
- goto out_free_ctrl;
+ goto out_free_opts;
}

ctrl->queues = kcalloc(ctrl->ctrl.queue_count, sizeof(*ctrl->queues),
GFP_KERNEL);
if (!ctrl->queues) {
ret = -ENOMEM;
- goto out_free_ctrl;
+ goto out_free_opts;
}

ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_tcp_ctrl_ops, 0);
@@ -2770,6 +2775,8 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
return ERR_PTR(ret);
out_kfree_queues:
kfree(ctrl->queues);
+out_free_opts:
+ nvmf_ctrl_options_put(opts);
out_free_ctrl:
kfree(ctrl);
return ERR_PTR(ret);
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index e589915ddef8..de2ff7ed0657 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -283,8 +283,8 @@ static void nvme_loop_free_ctrl(struct nvme_ctrl *nctrl)
if (nctrl->tagset)
nvme_remove_io_tag_set(nctrl);
kfree(ctrl->queues);
- nvmf_free_options(nctrl->opts);
free_ctrl:
+ nvmf_ctrl_options_put(nctrl->opts);
kfree(ctrl);
}

@@ -543,6 +543,12 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,
ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
if (!ctrl)
return ERR_PTR(-ENOMEM);
+
+ if (!nvmf_ctrl_options_get(opts)) {
+ ret = -ENOLCK;
+ goto out_free_ctrl;
+ }
+
ctrl->ctrl.opts = opts;
INIT_LIST_HEAD(&ctrl->list);

@@ -550,10 +556,8 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,

ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_loop_ctrl_ops,
0 /* no quirks, we're perfect! */);
- if (ret) {
- kfree(ctrl);
- goto out;
- }
+ if (ret)
+ goto out_free_opts;

if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
WARN_ON_ONCE(1);
@@ -612,7 +616,10 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,
out_uninit_ctrl:
nvme_uninit_ctrl(&ctrl->ctrl);
nvme_put_ctrl(&ctrl->ctrl);
-out:
+out_free_opts:
+ nvmf_ctrl_options_put(opts);
+out_free_ctrl:
+ kfree(ctrl);
if (ret > 0)
ret = -EIO;
return ERR_PTR(ret);
--
2.43.1


2024-02-21 13:25:48

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v2 5/5] nvme-fc: wait for initial connect attempt to finish

The TCP and RDMA transport are doing a synchronous connects, that is the
syscal returns with the final result. The operation either fails or
succeeds. The FC transport offloads the connect attempt to a workqueue
and thus it's an asynchronous operation.

This async connect feature was introduced to mitigate problems with
transient connect errors and the task to coordinate retries with
userspace (nvme-cli).

Unfortunately, this makes the transports behave differently on the
initial attempt. Streamline nvme-fc to wait for the initial connection
attempt to succeed or fail.

In order to support also the async connection attempt introduce a new
flag for userspace. The default is a synchronous initial connect
attempt.

Link: https://lore.kernel.org/linux-nvme/[email protected]/
Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/fabrics.c | 17 ++++++++++++++++-
drivers/nvme/host/fabrics.h | 3 +++
drivers/nvme/host/fc.c | 26 +++++++++++++++++++++++++-
3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 888285fe2289..52c95259debd 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -678,6 +678,7 @@ static const match_table_t opt_tokens = {
#ifdef CONFIG_NVME_TCP_TLS
{ NVMF_OPT_TLS, "tls" },
#endif
+ { NVMF_OPT_CONNECT_ASYNC, "connect_async=%d" },
{ NVMF_OPT_ERR, NULL }
};

@@ -706,6 +707,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
opts->tls = false;
opts->tls_key = NULL;
opts->keyring = NULL;
+ opts->connect_async = false;

options = o = kstrdup(buf, GFP_KERNEL);
if (!options)
@@ -1024,6 +1026,19 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
}
opts->tls = true;
break;
+ case NVMF_OPT_CONNECT_ASYNC:
+ if (match_int(args, &token)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (token < 0 || token > 1) {
+ pr_err("Invalid connect_async %d value\n",
+ token);
+ ret = -EINVAL;
+ goto out;
+ }
+ opts->connect_async = token;
+ break;
default:
pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
p);
@@ -1259,7 +1274,7 @@ EXPORT_SYMBOL_GPL(nvmf_ctrl_options_put);
NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
NVMF_OPT_DISABLE_SQFLOW | NVMF_OPT_DISCOVERY |\
NVMF_OPT_FAIL_FAST_TMO | NVMF_OPT_DHCHAP_SECRET |\
- NVMF_OPT_DHCHAP_CTRL_SECRET)
+ NVMF_OPT_DHCHAP_CTRL_SECRET | NVMF_OPT_CONNECT_ASYNC)

static struct nvme_ctrl *
nvmf_create_ctrl(struct device *dev, const char *buf)
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 8436533aed16..75bc61401e1b 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -73,6 +73,7 @@ enum {
NVMF_OPT_TLS = 1 << 25,
NVMF_OPT_KEYRING = 1 << 26,
NVMF_OPT_TLS_KEY = 1 << 27,
+ NVMF_OPT_CONNECT_ASYNC = 1 << 28,
};

/**
@@ -116,6 +117,7 @@ enum {
* @nr_poll_queues: number of queues for polling I/O
* @tos: type of service
* @fast_io_fail_tmo: Fast I/O fail timeout in seconds
+ * @connect_async: Don't wait for the intial connect attempt to succeed or fail
*/
struct nvmf_ctrl_options {
struct kref ref;
@@ -146,6 +148,7 @@ struct nvmf_ctrl_options {
unsigned int nr_poll_queues;
int tos;
int fast_io_fail_tmo;
+ bool connect_async;
};

int nvmf_ctrl_options_get(struct nvmf_ctrl_options *opts);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 7627d10a5812..ebc4ff6b3343 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -169,6 +169,7 @@ struct nvme_fc_ctrl {

struct work_struct ioerr_work;
struct delayed_work connect_work;
+ struct completion connect_completion;

struct kref ref;
unsigned long flags;
@@ -803,6 +804,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: Couldn't schedule reset.\n",
ctrl->cnum);
+ complete(&ctrl->connect_completion);
nvme_fc_ctrl_put(ctrl);
}
break;
@@ -871,6 +873,7 @@ nvme_fc_unregister_remoteport(struct nvme_fc_remote_port *portptr)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: controller connectivity lost.\n",
ctrl->cnum);
+ complete(&ctrl->connect_completion);
nvme_fc_ctrl_put(ctrl);
} else
nvme_fc_ctrl_connectivity_loss(ctrl);
@@ -3326,6 +3329,7 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
ctrl->cnum, min_t(int, portptr->dev_loss_tmo,
(ctrl->ctrl.opts->max_reconnects *
ctrl->ctrl.opts->reconnect_delay)));
+ complete(&ctrl->connect_completion);
nvme_fc_ctrl_put(ctrl);
}
}
@@ -3385,10 +3389,12 @@ nvme_fc_connect_ctrl_work(struct work_struct *work)
ret = nvme_fc_create_association(ctrl);
if (ret)
nvme_fc_reconnect_or_delete(ctrl, ret);
- else
+ else {
dev_info(ctrl->ctrl.device,
"NVME-FC{%d}: controller connect complete\n",
ctrl->cnum);
+ complete(&ctrl->connect_completion);
+ }
}


@@ -3494,6 +3500,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,

INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
+ init_completion(&ctrl->connect_completion);
INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
spin_lock_init(&ctrl->lock);

@@ -3541,6 +3548,9 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
list_add_tail(&ctrl->ctrl_list, &rport->ctrl_list);
spin_unlock_irqrestore(&rport->lock, flags);

+ if (!opts->connect_async)
+ nvme_fc_ctrl_get(ctrl);
+
if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING) ||
!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
dev_err(ctrl->ctrl.device,
@@ -3557,6 +3567,19 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,

flush_delayed_work(&ctrl->connect_work);

+ if (!opts->connect_async) {
+ enum nvme_ctrl_state state;
+
+ wait_for_completion(&ctrl->connect_completion);
+ state = nvme_ctrl_state(&ctrl->ctrl);
+ nvme_fc_ctrl_put(ctrl);
+
+ if (state != NVME_CTRL_LIVE) {
+ /* Cleanup is handled by the connect state machine */
+ return ERR_PTR(-EIO);
+ }
+ }
+
dev_info(ctrl->ctrl.device,
"NVME-FC{%d}: new ctrl: NQN \"%s\", hostnqn: %s\n",
ctrl->cnum, nvmf_ctrl_subsysnqn(&ctrl->ctrl), opts->host->nqn);
@@ -3913,6 +3936,7 @@ nvme_fc_delete_controllers(struct nvme_fc_rport *rport)
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: transport unloading: deleting ctrl\n",
ctrl->cnum);
+ complete(&ctrl->connect_completion);
nvme_fc_ctrl_put(ctrl);
}
spin_unlock(&rport->lock);
--
2.43.1


2024-02-21 15:53:09

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] nvme-fc: rename free_ctrl callback to match name pattern

On 2/21/24 14:24, Daniel Wagner wrote:
> Rename nvme_fc_nvme_ctrl_freed to nvme_fc_free_ctrl to match the name
> pattern for the callback.
>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Daniel Wagner <[email protected]>
> ---
> drivers/nvme/host/fc.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-02-21 15:54:37

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] nvme-fc: do not retry when auth fails or connection is refused

On 2/21/24 14:24, Daniel Wagner wrote:
> There is no point in retrying to connect if the authentication fails.
>
> Connection refused is also issued from the authentication path, thus
> also do not retry.
>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Daniel Wagner <[email protected]>
> ---
> drivers/nvme/host/fc.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index a5b29e9ad342..b81046c9f171 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -3312,6 +3312,8 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
> ctrl->cnum, status);
> if (status > 0 && (status & NVME_SC_DNR))
> recon = false;
> + if (status == NVME_SC_AUTH_REQUIRED || status == -ECONNREFUSED)
> + recon = false;
> } else if (time_after_eq(jiffies, rport->dev_loss_end))
> recon = false;
>
Is this still required after the patchset to retry authentication errors?

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-02-21 15:57:58

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] nvme-fabrics: introduce ref counting for nvmf_ctrl_options

On 2/21/24 14:24, Daniel Wagner wrote:
> The FC transport is offloading the connect attempt to a workqueue. When
> the attempt fails the transport is starting to cleanup resources. It is
> possible for user space to trigger a crash because nvmf_ctrl_options are
> exposed to sysfs.
>
> This crash wasn't observed with blktests nvme/041 until now because the
> retry loop was usually trying for several times (e.g. with defaults
> 600s) and the test would trigger the cleanup itself. Though we the
> recent change not retrying to use invalid credentials the crash can be
> easily triggered.
>
> The simplest way to control the life time of nvmf_ctrl_options is by
> using ref counting.
>
> Signed-off-by: Daniel Wagner <[email protected]>
> ---
> drivers/nvme/host/fabrics.c | 22 +++++++++++++++++++---
> drivers/nvme/host/fabrics.h | 6 +++++-
> drivers/nvme/host/fc.c | 14 +++++++++-----
> drivers/nvme/host/rdma.c | 18 +++++++++++++-----
> drivers/nvme/host/tcp.c | 21 ++++++++++++++-------
> drivers/nvme/target/loop.c | 19 +++++++++++++------
> 6 files changed, 73 insertions(+), 27 deletions(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes

--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-02-21 16:40:03

by Daniel Wagner

[permalink] [raw]
Subject: Re: Re: [PATCH v2 2/5] nvme-fc: do not retry when auth fails or connection is refused

On Wed, Feb 21, 2024 at 04:53:44PM +0100, Hannes Reinecke wrote:
> On 2/21/24 14:24, Daniel Wagner wrote:
> > There is no point in retrying to connect if the authentication fails.
> >
> > Connection refused is also issued from the authentication path, thus
> > also do not retry.
> >
> > Reviewed-by: Christoph Hellwig <[email protected]>
> > Signed-off-by: Daniel Wagner <[email protected]>
> > ---
> > drivers/nvme/host/fc.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> > index a5b29e9ad342..b81046c9f171 100644
> > --- a/drivers/nvme/host/fc.c
> > +++ b/drivers/nvme/host/fc.c
> > @@ -3312,6 +3312,8 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
> > ctrl->cnum, status);
> > if (status > 0 && (status & NVME_SC_DNR))
> > recon = false;
> > + if (status == NVME_SC_AUTH_REQUIRED || status == -ECONNREFUSED)
> > + recon = false;
> > } else if (time_after_eq(jiffies, rport->dev_loss_end))
> > recon = false;
> Is this still required after the patchset to retry authentication errors?

Do you mean

48dae46676d1 ("nvme: enable retries for authentication commands")

?

In this case yes, I've tested on top of this patch. This breaks the loop
where the provided key is invalid or is missing. The loop would happy
retry until reaching max of retries.

2024-02-21 18:04:25

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] nvme-fc: fix blktests nvme/041

On Wed, Feb 21, 2024 at 02:23:59PM +0100, Daniel Wagner wrote:
> As suggested by Keith I've merged the new flag for nvme-fabrics with the last
> patch. I also added some information why nvme-fc is handling the initial connect
> attempt differently as requested by Christoph.
>
> The new flag is called connect_async and the default is false/0. I've tested
> with a patched nvme-cli/libnvme version and all looks good. When this series is
> accepted I'll update nvme-cli/libnvme accordingly. Note, that even with an older
> nvme-cli the blktests (especially nvme/041) will pass with a newer kernel.
>
> (nvme/048 is still fails but this is a different problem. On my TODO list)

Series looks good to me. Sounds like it's not fixing any regressions, so
I think this goes to the 6.9 branch. I'll wait till end of week for any
other feedback before pushing anything out.

2024-02-22 06:46:25

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] nvme-fc: do not retry when auth fails or connection is refused

On 2/21/24 17:37, Daniel Wagner wrote:
> On Wed, Feb 21, 2024 at 04:53:44PM +0100, Hannes Reinecke wrote:
>> On 2/21/24 14:24, Daniel Wagner wrote:
>>> There is no point in retrying to connect if the authentication fails.
>>>
>>> Connection refused is also issued from the authentication path, thus
>>> also do not retry.
>>>
>>> Reviewed-by: Christoph Hellwig <[email protected]>
>>> Signed-off-by: Daniel Wagner <[email protected]>
>>> ---
>>> drivers/nvme/host/fc.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
>>> index a5b29e9ad342..b81046c9f171 100644
>>> --- a/drivers/nvme/host/fc.c
>>> +++ b/drivers/nvme/host/fc.c
>>> @@ -3312,6 +3312,8 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
>>> ctrl->cnum, status);
>>> if (status > 0 && (status & NVME_SC_DNR))
>>> recon = false;
>>> + if (status == NVME_SC_AUTH_REQUIRED || status == -ECONNREFUSED)
>>> + recon = false;
>>> } else if (time_after_eq(jiffies, rport->dev_loss_end))
>>> recon = false;
>> Is this still required after the patchset to retry authentication errors?
>
> Do you mean
>
> 48dae46676d1 ("nvme: enable retries for authentication commands")
>
> ?
Yes.

>
> In this case yes, I've tested on top of this patch. This breaks the loop
> where the provided key is invalid or is missing. The loop would happy
> retry until reaching max of retries.

But that's to be expected, no? After all, that's _precisely_ what
NVME_SC_DNR is for; if you shouldn't retry, that bit is set.
If it's not set, you should.
So why is NVME_SC_AUTH_REQUIRED an exception?

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-02-22 07:45:34

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] nvme-fc: do not retry when auth fails or connection is refused

On Thu, Feb 22, 2024 at 07:46:12AM +0100, Hannes Reinecke wrote:
> On 2/21/24 17:37, Daniel Wagner wrote:
> > On Wed, Feb 21, 2024 at 04:53:44PM +0100, Hannes Reinecke wrote:
> > In this case yes, I've tested on top of this patch. This breaks the loop
> > where the provided key is invalid or is missing. The loop would happy
> > retry until reaching max of retries.
>
> But that's to be expected, no?

Why? If the key is wrong/missing it will be likely wrong/missing the
next retry again. So what's the point in retrying?

> After all, that's _precisely_ what
> NVME_SC_DNR is for;
> if you shouldn't retry, that bit is set.
> If it's not set, you should.

Okay, in this case there is a bug in the auth code somewhere.

> So why is NVME_SC_AUTH_REQUIRED an exception?

status is either NVME_SC_AUTH_REQUIRED or -ECONNREFUSED for the first
part of the nvme/041 (no key set). In this case DNR bit is not set.

Note, when -ECONNREFUSED is return

if (status > 0 && (status & NVME_SC_DNR))

will not catch it either.

Glad we have these tests now.

2024-02-22 17:11:45

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] nvme-fc: do not retry when auth fails or connection is refused

On Thu, Feb 22, 2024 at 08:45:04AM +0100, Daniel Wagner wrote:
> On Thu, Feb 22, 2024 at 07:46:12AM +0100, Hannes Reinecke wrote:
> > On 2/21/24 17:37, Daniel Wagner wrote:
> > > On Wed, Feb 21, 2024 at 04:53:44PM +0100, Hannes Reinecke wrote:
> > > In this case yes, I've tested on top of this patch. This breaks the loop
> > > where the provided key is invalid or is missing. The loop would happy
> > > retry until reaching max of retries.
> >
> > But that's to be expected, no?
>
> Why? If the key is wrong/missing it will be likely wrong/missing the
> next retry again. So what's the point in retrying?
>
> > After all, that's _precisely_ what
> > NVME_SC_DNR is for;
> > if you shouldn't retry, that bit is set.
> > If it's not set, you should.
>
> Okay, in this case there is a bug in the auth code somewhere.

With the change below nvme/041 also passes:

modified drivers/nvme/host/fabrics.c
@@ -467,7 +467,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
if (result & NVME_CONNECT_AUTHREQ_ASCR) {
dev_warn(ctrl->device,
"qid 0: secure concatenation is not supported\n");
- ret = NVME_SC_AUTH_REQUIRED;
+ ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
goto out_free_data;
}
/* Authentication required */
@@ -475,7 +475,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
if (ret) {
dev_warn(ctrl->device,
"qid 0: authentication setup failed\n");
- ret = NVME_SC_AUTH_REQUIRED;
+ ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
goto out_free_data;
}
ret = nvme_auth_wait(ctrl, 0);
@@ -540,8 +540,9 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
/* Secure concatenation is not implemented */
if (result & NVME_CONNECT_AUTHREQ_ASCR) {
dev_warn(ctrl->device,
- "qid 0: secure concatenation is not supported\n");
- ret = NVME_SC_AUTH_REQUIRED;
+ "qid %d: secure concatenation is not supported\n",
+ qid);
+ ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
goto out_free_data;
}
/* Authentication required */
@@ -549,7 +550,7 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
if (ret) {
dev_warn(ctrl->device,
"qid %d: authentication setup failed\n", qid);
- ret = NVME_SC_AUTH_REQUIRED;
+ ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
} else {
ret = nvme_auth_wait(ctrl, qid);
if (ret)

Is this what you had in mind?

2024-02-23 11:59:28

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] nvme-fc: do not retry when auth fails or connection is refused

On 2/22/24 18:02, Daniel Wagner wrote:
> On Thu, Feb 22, 2024 at 08:45:04AM +0100, Daniel Wagner wrote:
>> On Thu, Feb 22, 2024 at 07:46:12AM +0100, Hannes Reinecke wrote:
>>> On 2/21/24 17:37, Daniel Wagner wrote:
>>>> On Wed, Feb 21, 2024 at 04:53:44PM +0100, Hannes Reinecke wrote:
>>>> In this case yes, I've tested on top of this patch. This breaks the loop
>>>> where the provided key is invalid or is missing. The loop would happy
>>>> retry until reaching max of retries.
>>>
>>> But that's to be expected, no?
>>
>> Why? If the key is wrong/missing it will be likely wrong/missing the
>> next retry again. So what's the point in retrying?
>>
>>> After all, that's _precisely_ what
>>> NVME_SC_DNR is for;
>>> if you shouldn't retry, that bit is set.
>>> If it's not set, you should.
>>
>> Okay, in this case there is a bug in the auth code somewhere.
>
> With the change below nvme/041 also passes:
>
> modified drivers/nvme/host/fabrics.c
> @@ -467,7 +467,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
> if (result & NVME_CONNECT_AUTHREQ_ASCR) {
> dev_warn(ctrl->device,
> "qid 0: secure concatenation is not supported\n");
> - ret = NVME_SC_AUTH_REQUIRED;
> + ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
> goto out_free_data;
> }
> /* Authentication required */
> @@ -475,7 +475,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
> if (ret) {
> dev_warn(ctrl->device,
> "qid 0: authentication setup failed\n");
> - ret = NVME_SC_AUTH_REQUIRED;
> + ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
> goto out_free_data;
> }
> ret = nvme_auth_wait(ctrl, 0);
> @@ -540,8 +540,9 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
> /* Secure concatenation is not implemented */
> if (result & NVME_CONNECT_AUTHREQ_ASCR) {
> dev_warn(ctrl->device,
> - "qid 0: secure concatenation is not supported\n");
> - ret = NVME_SC_AUTH_REQUIRED;
> + "qid %d: secure concatenation is not supported\n",
> + qid);
> + ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
> goto out_free_data;
> }
> /* Authentication required */
> @@ -549,7 +550,7 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
> if (ret) {
> dev_warn(ctrl->device,
> "qid %d: authentication setup failed\n", qid);
> - ret = NVME_SC_AUTH_REQUIRED;
> + ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
> } else {
> ret = nvme_auth_wait(ctrl, qid);
> if (ret)
>
> Is this what you had in mind?

Which, incidentally, is basically the patch I just posted.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-03-06 14:36:00

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] nvme-fc: fix blktests nvme/041

On Wed, Feb 21, 2024 at 11:04:13AM -0700, Keith Busch wrote:
> On Wed, Feb 21, 2024 at 02:23:59PM +0100, Daniel Wagner wrote:
> > As suggested by Keith I've merged the new flag for nvme-fabrics with the last
> > patch. I also added some information why nvme-fc is handling the initial connect
> > attempt differently as requested by Christoph.
> >
> > The new flag is called connect_async and the default is false/0. I've tested
> > with a patched nvme-cli/libnvme version and all looks good. When this series is
> > accepted I'll update nvme-cli/libnvme accordingly. Note, that even with an older
> > nvme-cli the blktests (especially nvme/041) will pass with a newer kernel.
> >
> > (nvme/048 is still fails but this is a different problem. On my TODO list)
>
> Series looks good to me. Sounds like it's not fixing any regressions, so
> I think this goes to the 6.9 branch. I'll wait till end of week for any
> other feedback before pushing anything out.

The second patch should be dropped in this series and replaced with the
patches from Hannes '[PATCH 0/4] nvme: fixes for authentication errors'

https://lore.kernel.org/linux-nvme/[email protected]/

Should I send out new version or do you take it from here?

2024-03-07 10:27:56

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] nvme-fabrics: introduce ref counting for nvmf_ctrl_options



On 21/02/2024 15:24, Daniel Wagner wrote:
> The FC transport is offloading the connect attempt to a workqueue. When
> the attempt fails the transport is starting to cleanup resources. It is
> possible for user space to trigger a crash because nvmf_ctrl_options are
> exposed to sysfs.
>
> This crash wasn't observed with blktests nvme/041 until now because the
> retry loop was usually trying for several times (e.g. with defaults
> 600s) and the test would trigger the cleanup itself. Though we the
> recent change not retrying to use invalid credentials the crash can be
> easily triggered.
>
> The simplest way to control the life time of nvmf_ctrl_options is by
> using ref counting.

Why do we need a refcount for an object that has the same exact lifetime
as the ctrl itself? It just feels like unneeded complication.

2024-03-07 10:36:05

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] nvme-fc: do not retry when auth fails or connection is refused



On 23/02/2024 13:58, Hannes Reinecke wrote:
> On 2/22/24 18:02, Daniel Wagner wrote:
>> On Thu, Feb 22, 2024 at 08:45:04AM +0100, Daniel Wagner wrote:
>>> On Thu, Feb 22, 2024 at 07:46:12AM +0100, Hannes Reinecke wrote:
>>>> On 2/21/24 17:37, Daniel Wagner wrote:
>>>>> On Wed, Feb 21, 2024 at 04:53:44PM +0100, Hannes Reinecke wrote:
>>>>> In this case yes, I've tested on top of this patch. This breaks
>>>>> the loop
>>>>> where the provided key is invalid or is missing. The loop would happy
>>>>> retry until reaching max of retries.
>>>>
>>>> But that's to be expected, no?
>>>
>>> Why? If the key is wrong/missing it will be likely wrong/missing the
>>> next retry again. So what's the point in retrying?
>>>
>>>> After all, that's _precisely_ what
>>>> NVME_SC_DNR is for;
>>>> if you shouldn't retry, that bit is set.
>>>> If it's not set, you should.
>>>
>>> Okay, in this case there is a bug in the auth code somewhere.
>>
>> With the change below nvme/041 also passes:
>>
>> modified   drivers/nvme/host/fabrics.c
>> @@ -467,7 +467,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
>>           if (result & NVME_CONNECT_AUTHREQ_ASCR) {
>>               dev_warn(ctrl->device,
>>                    "qid 0: secure concatenation is not supported\n");
>> -            ret = NVME_SC_AUTH_REQUIRED;
>> +            ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
>>               goto out_free_data;
>>           }
>>           /* Authentication required */
>> @@ -475,7 +475,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
>>           if (ret) {
>>               dev_warn(ctrl->device,
>>                    "qid 0: authentication setup failed\n");
>> -            ret = NVME_SC_AUTH_REQUIRED;
>> +            ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
>>               goto out_free_data;
>>           }
>>           ret = nvme_auth_wait(ctrl, 0);
>> @@ -540,8 +540,9 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl,
>> u16 qid)
>>           /* Secure concatenation is not implemented */
>>           if (result & NVME_CONNECT_AUTHREQ_ASCR) {
>>               dev_warn(ctrl->device,
>> -                 "qid 0: secure concatenation is not supported\n");
>> -            ret = NVME_SC_AUTH_REQUIRED;
>> +                 "qid %d: secure concatenation is not supported\n",
>> +                 qid);
>> +            ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
>>               goto out_free_data;
>>           }
>>           /* Authentication required */
>> @@ -549,7 +550,7 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl,
>> u16 qid)
>>           if (ret) {
>>               dev_warn(ctrl->device,
>>                    "qid %d: authentication setup failed\n", qid);
>> -            ret = NVME_SC_AUTH_REQUIRED;
>> +            ret = NVME_SC_AUTH_REQUIRED | NVME_SC_DNR;
>>           } else {
>>               ret = nvme_auth_wait(ctrl, qid);
>>               if (ret)
>>
>> Is this what you had in mind?
>
> Which, incidentally, is basically the patch I just posted.

Reading this, the patchset from Hannes now is clearer.
Isn't the main issue is that nvme-fc tries to periodicly reconnect
when failing the first connect? This is exactly what the test expects
it to do right?

2024-03-07 10:40:54

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] nvme-fc: wait for initial connect attempt to finish



On 21/02/2024 15:24, Daniel Wagner wrote:
> The TCP and RDMA transport are doing a synchronous connects, that is the
> syscal returns with the final result. The operation either fails or
> succeeds. The FC transport offloads the connect attempt to a workqueue
> and thus it's an asynchronous operation.
>
> This async connect feature was introduced to mitigate problems with
> transient connect errors and the task to coordinate retries with
> userspace (nvme-cli).
Maybe it is a better idea to let userspace handle this? the whole async
connect (and reconnect)
just seems it is causing more issues than solving... I may have missed
some conversations
about this one...

2024-03-07 14:53:38

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] nvme-fc: do not retry when auth fails or connection is refused

On Thu, Mar 07, 2024 at 12:25:16PM +0200, Sagi Grimberg wrote:
> > Is this what you had in mind?
> >
> > Which, incidentally, is basically the patch I just posted.
>
> Reading this, the patchset from Hannes now is clearer.
> Isn't the main issue is that nvme-fc tries to periodicly reconnect
> when failing the first connect? This is exactly what the test expects
> it to do right?

Yes, the test expects that the initial connect attempt fails. nvme-fc is
using one connect path and doesn't distinguish between the initial
connect attempt and a reconnect.

All fabric transport share the same problem when the connection has been
established and later one a connection drop happens and a reconnnect is
executed. The blktest nvme/048 case extension I've posted tests the
reconnect attempt after the controller key has changed. In this
scenario, nvme-tcp, nvme-rdma will also do a reconnect attempt although
DNR is set.

2024-03-11 17:36:30

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] nvme-fabrics: introduce ref counting for nvmf_ctrl_options

On Thu, Mar 07, 2024 at 12:27:43PM +0200, Sagi Grimberg wrote:
> Why do we need a refcount for an object that has the same exact lifetime
> as the ctrl itself? It just feels like unneeded complication.

My claim the UAF is also possible with the current code is not correct.
Or at least not easy to reproduce. I've re-tested a lot and I couldn't
reproduce it.

Though, the UAF is very simple to reproduce with the sync connect patch
applied (nvme-fc: wait for initial connect attempt to finish) together
with Hannes' patch (nvme: authentication error are always
non-retryable):

In this case, the initial connect fails and the resources are removed,
while we are waiting in

+ if (!opts->connect_async) {
+ enum nvme_ctrl_state state;
+
+ wait_for_completion(&ctrl->connect_completion);
+ state = nvme_ctrl_state(&ctrl->ctrl);
+ nvme_fc_ctrl_put(ctrl);
+
+ if (state != NVME_CTRL_LIVE) {
+ /* Cleanup is handled by the connect state machine */
+ return ERR_PTR(-EIO);
+ }
+ }

This opens up the race window. While we are waiting here for the
completion, the ctrl entry in sysfs is still reachable. Unfortunately,
we also fire an uevent which starts another instance of nvme-cli. And
the new instance of nvme-cli iterates over sysfs and reads the already
freed options object.

run blktests nvme/041 at 2024-03-11 18:13:38
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:0} Association created
[8167] nvmet: ctrl 1 start keep-alive timer for 5 secs
[8167] nvmet: check nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[8167] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0
[8167] nvmet: nvmet_setup_auth: using hash none key fb 28 d3 79 af 04 ba 36 95 3b e5 89 6c bf 42 90 4a dd dd 1b d4 e8 ba ce b2 7c 16 d4 01 7d 4f 20
nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
nvme nvme0: qid 0: no key
nvme nvme0: qid 0: authentication setup failed
nvme nvme0: NVME-FC{0}: create_assoc failed, assoc_id f2139b60a42c0000 ret 16785
nvme nvme0: NVME-FC{0}: reset: Reconnect attempt failed (16785)
nvme nvme0: NVME-FC{0}: reconnect failure
nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
==================================================================
BUG: KASAN: slab-use-after-free in nvme_class_uevent+0xb9/0x1a0 [nvme_core]
Read of size 8 at addr ffff888107229698 by task systemd-journal/578

CPU: 1 PID: 578 Comm: systemd-journal Tainted: G W 6.8.0-rc6+ #43 106200e85ab1e5c3399a68beb80cc63ca4823f3a
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x80
print_report+0x163/0x800
? __virt_addr_valid+0x2f3/0x340
? nvme_class_uevent+0xb9/0x1a0 [nvme_core a5a8fc3d48e3ec2a76ff6521d70aebe532cfd700]
kasan_report+0xd0/0x110
? nvme_class_uevent+0xb9/0x1a0 [nvme_core a5a8fc3d48e3ec2a76ff6521d70aebe532cfd700]
nvme_class_uevent+0xb9/0x1a0 [nvme_core a5a8fc3d48e3ec2a76ff6521d70aebe532cfd700]
dev_uevent+0x374/0x640
uevent_show+0x187/0x2a0
dev_attr_show+0x5f/0xb0
sysfs_kf_seq_show+0x2a8/0x3f0
? __cfi_dev_attr_show+0x10/0x10
seq_read_iter+0x3f1/0xc00
vfs_read+0x6cf/0x960
ksys_read+0xd7/0x1a0
do_syscall_64+0xb1/0x180
? do_syscall_64+0xc0/0x180
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f4297b0a3dc
Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 97 18 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 fd

RSP: 002b:00007ffd945ec430 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000564732197e60 RCX: 00007f4297b0a3dc
RDX: 0000000000001008 RSI: 0000564732197e60 RDI: 000000000000001a
RBP: 000000000000001a R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000001007
R13: 0000000000001008 R14: ffffffffffffffff R15: 0000000000000002
</TASK>

Allocated by task 31249 on cpu 0 at 5508.645525s:
kasan_save_track+0x2c/0x90
__kasan_kmalloc+0x89/0xa0
kmalloc_trace+0x1f3/0x3c0
nvmf_dev_write+0x15c/0x2990 [nvme_fabrics]
vfs_write+0x1cd/0xb60
ksys_write+0xd7/0x1a0
do_syscall_64+0xb1/0x180
entry_SYSCALL_64_after_hwframe+0x6e/0x76

Freed by task 31249 on cpu 2 at 5508.686805s:
kasan_save_track+0x2c/0x90
kasan_save_free_info+0x4a/0x60
poison_slab_object+0x108/0x180
__kasan_slab_free+0x33/0x80
kfree+0x119/0x310
nvmf_dev_write+0x23e0/0x2990 [nvme_fabrics]
vfs_write+0x1cd/0xb60
ksys_write+0xd7/0x1a0
do_syscall_64+0xb1/0x180
entry_SYSCALL_64_after_hwframe+0x6e/0x76

The buggy address belongs to the object at ffff888107229680
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 24 bytes inside of
freed 192-byte region [ffff888107229680, ffff888107229740)

The buggy address belongs to the physical page:
page:0000000070cf556f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x107228
head:0000000070cf556f order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000840 ffff888100042c00 ffffea0004363500 dead000000000004
raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888107229580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc
ffff888107229600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888107229680: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888107229700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff888107229780: fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb fb
==================================================================

2024-03-11 19:29:09

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] nvme-fabrics: introduce ref counting for nvmf_ctrl_options

On 3/11/24 18:36, Daniel Wagner wrote:
> On Thu, Mar 07, 2024 at 12:27:43PM +0200, Sagi Grimberg wrote:
>> Why do we need a refcount for an object that has the same exact lifetime
>> as the ctrl itself? It just feels like unneeded complication.
>
> My claim the UAF is also possible with the current code is not correct.
> Or at least not easy to reproduce. I've re-tested a lot and I couldn't
> reproduce it.
>
> Though, the UAF is very simple to reproduce with the sync connect patch
> applied (nvme-fc: wait for initial connect attempt to finish) together
> with Hannes' patch (nvme: authentication error are always
> non-retryable):
>
> In this case, the initial connect fails and the resources are removed,
> while we are waiting in
>
> + if (!opts->connect_async) {
> + enum nvme_ctrl_state state;
> +
> + wait_for_completion(&ctrl->connect_completion);
> + state = nvme_ctrl_state(&ctrl->ctrl);
> + nvme_fc_ctrl_put(ctrl);
> +
> + if (state != NVME_CTRL_LIVE) {
> + /* Cleanup is handled by the connect state machine */
> + return ERR_PTR(-EIO);
> + }
> + }
>
> This opens up the race window. While we are waiting here for the
> completion, the ctrl entry in sysfs is still reachable. Unfortunately,
> we also fire an uevent which starts another instance of nvme-cli. And
> the new instance of nvme-cli iterates over sysfs and reads the already
> freed options object.
>
Curiously enough, I had been digging into better error reporting for
nvme-fabrics. And the one thing I came up with is to make the controller
_options_ as a private pointer to seq_file.
With that we can allocate and initialize the options during open(),
and then have write() do the parsing and calling create_ctrl() as usual.
But read() would then always have access to the option structure, and
we can use this structure to pass any errors. EG parsing errors could
be reported by an 'err_mask' field and so on.

That would allow us to report errors back to nvme-cli, and,
incidentally, also require reference counting.
Two stones with a bird and all that.

Patch is in testing, and I'll be posting it once I get confirmation.


Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich