2019-07-25 18:12:47

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 00/16] nvmet: add target passthru commands support

Hi,

Chaitainya has asked us to take on these patches as we have an
interest in getting them into upstream. To that end, we've done
a large amount of testing, bug fixes and cleanup.

Passthru support for nvmet allows users to export an entire
NVMe controller through NVMe-oF. When exported in this way (as opposed
to exporting each namespace as a block device), all the NVMe commands
are passed to the given controller unmodified, including most admin
commands and Vendor Unique Commands (VUCs). A passthru target will
expose all namespaces for a given device to the remote host.

There are three major non-bugfix changes that we've done to the series:

1) Instead of using a seperate special passthru subsystem in
configfs simply add a passthru directory that's analogous to
the existing namespace directories. The directories have
very similar attributes to namespaces but are mutually exclusive.
If a user enables a namespaces, they can't then enable
passthru controller and vice versa. This simplifies the code
required to implement passthru configfs and IMO creates a much
clearer and uniform interface.

2) Instead of taking a bare controller name (ie. "nvme1"), take a
full device path to the controller's char device. This is more
consistent with the regular namespaces which take a path and
also allows users to make use of udev rules and symlinks to
manage their controllers instead of the potentially unstable
device names.

3) Implement block accounting for the passthru devices. This is so
the target OS can still track device usage using /proc/diskstats.

Besides these three changes, we've also found a large number of bugs
and crashes and did a bunch of testing with KASAN, lockdep and kmemleak.
A more complete list of changes is given below.

Additionally, we've written some new blktests to test the passthru
code. A branch is available here[1] and can be submitted once these
patches are upstream.

These patches are based off of v5.3-rc1 and a git branch is available
at [2].

Thanks,

Logan

[1] https://github.com/Eideticom/blktests nvmet_passthru
[2] https://github.com/sbates130272/linux-p2pmem/ nvmet_passthru_v6

--

v6 Changes (done by Logan):
1. Rebased onto v5.3-rc1
2. Rework configfs interface to simply be a passthru directory
within the existing subsystem. The directory is similar to
and consistent with a namespace directory.
3. Have the configfs take a path instead of a bare controller name
4. Renamed the main passthru file to io-cmd-passthru.c for consistency
with the file and block-dev methods.
5. Cleaned up all the CONFIG_NVME_TARGET_PASSTHRU usage to remove
all the inline #ifdefs
6. Restructured nvmet_passthru_make_request() a bit for clearer code
7. Moved nvme_find_get_ns() call into nvmet_passthru_execute_cmd()
seeing calling it in nvmet_req_init() causes a lockdep warning
due to nvme_find_get_ns() being able to sleep.
8. Added a check in nvmet_passthru_execute_cmd() to ensure we don't
violate queue_max_segments or queue_max_hw_sectors and overrode
mdts to ensure hosts don't intentionally submit commands
that will exceed these limits.
9. Reworked the code which ensures there's only one subsystem per
passthru controller to use an xarray instead of a list as this is
simpler and more easily fixed some bugs triggered by disabling
subsystems that weren't enabled.
10. Removed the overide of the target cntlid with the passthru cntlid;
this seemed like a really bad idea especially in the presence of
mixed systems as you could end up with two ctrlrs with the same
cntlid. For now, commands that depend on cntlid are black listed.
11. Implement block accounting for passthru so the target can track
usage using /proc/diskstats
12. A number of other minor bug fixes and cleanups

v5 Changes (not sent to list, from Chaitanya):
1. Added workqueue for admin commands.
2. Added kconfig option for the pass-thru feature.
3. Restructure the parsing code according to your suggestion,
call nvmet_xxx_parse_cmd() from nvmet_passthru_parse_cmd().
4. Use pass-thru instead of pt.
5. Several cleanups and add comments at the appropriate locations.
6. Minimize the code for checking pass-thru ns across all the subsystems.
7. Removed the delays in the ns related admin commands since I was
not able to reproduce the previous bug.

v4 Changes:
1. Add request polling interface to the block layer.
2. Use request polling interface in the NVMEoF target passthru code
path.
3. Add checks suggested by Sagi for creating one target ctrl per
passthru ctrl.
4. Don't enable the namespace if it belongs to the configured passthru
ctrl.
5. Adjust the code latest kernel.

v3 Changes:
1. Split the addition of passthru command handlers and integration
into two different patches since we add guards to create one target
controller per passthru controller. This way it will be easier to
review the code.
2. Adjust the code for 4.18.

v2 Changes:
1. Update the new nvme core controller find API naming and
changed the string comparison of the ctrl.
2. Get rid of the newly added #defines for target ctrl values.
3. Use the newly added structure members in the same patch where
they are used. Aggregate the passthru command handling support
and integration with nvmet-core into one patch.
4. Introduce global NVMe Target subsystem list for connected and
not connected subsystems on the target side.
5. Add check when configuring the target ns and target
passthru ctrl to allow only one target controller to be created
for one passthru subsystem.
6. Use the passthru ctrl cntlid when creating the
target controller.

Chaitanya Kulkarni (6):
nvme-core: export existing ctrl and ns interfaces
nvmet: add return value to nvmet_add_async_event()
nvmet-passthru: update KConfig with config passthru option
nvmet-passthru: add passthru code to process commands
nvmet-core: allow one host per passthru-ctrl
nvmet-core: don't check the data len for pt-ctrl

Logan Gunthorpe (10):
chardev: factor out cdev_lookup() helper
chardev: introduce cdev_get_by_path()
chardev: export cdev_put()
nvme-core: introduce nvme_get_by_path()
nvmet: make nvmet_copy_ns_identifier() non-static
nvmet-passthru: add enable/disable helpers
nvmet-configfs: introduce passthru configfs interface
block: don't check blk_rq_is_passthrough() in blk_do_io_stat()
block: call blk_account_io_start() in blk_execute_rq_nowait()
nvmet-passthru: support block accounting

block/blk-exec.c | 2 +
block/blk-mq.c | 2 +-
block/blk.h | 5 +-
drivers/nvme/host/core.c | 40 +-
drivers/nvme/host/nvme.h | 9 +
drivers/nvme/target/Kconfig | 10 +
drivers/nvme/target/Makefile | 1 +
drivers/nvme/target/admin-cmd.c | 4 +-
drivers/nvme/target/configfs.c | 99 ++++
drivers/nvme/target/core.c | 41 +-
drivers/nvme/target/io-cmd-passthru.c | 681 ++++++++++++++++++++++++++
drivers/nvme/target/nvmet.h | 71 ++-
fs/char_dev.c | 61 ++-
include/linux/cdev.h | 1 +
14 files changed, 1003 insertions(+), 24 deletions(-)
create mode 100644 drivers/nvme/target/io-cmd-passthru.c

--
2.20.1


2019-07-25 18:12:48

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 11/16] nvmet-core: allow one host per passthru-ctrl

From: Chaitanya Kulkarni <[email protected]>

This patch rejects any new connection to the passthru-ctrl if this
controller is already connected to host. At the time of allocating the
controller we check if the subsys associated with the passthru ctrl is
already connected to the host.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
[[email protected]:
* drop the overide of the target cntlid with the passthru cntlid;
this seemed like a really bad idea especially in the presence of
mixed systems as you could end up with two ctrlrs with the same
cntlid
* push the check to ensure only one ctrlr per subsys into
functions in passthru-cmd.c to avoid excess inline #ifdefs
]
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/target/core.c | 6 +++++
drivers/nvme/target/io-cmd-passthru.c | 32 +++++++++++++++++++++++++++
drivers/nvme/target/nvmet.h | 10 +++++++++
3 files changed, 48 insertions(+)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 2e75968af7f4..9e92486e2ee9 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -13,6 +13,7 @@
#define CREATE_TRACE_POINTS
#include "trace.h"

+#include "../host/nvme.h"
#include "nvmet.h"

struct workqueue_struct *buffered_io_wq;
@@ -1278,6 +1279,10 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
if (!ctrl->sqs)
goto out_free_cqs;

+ ret = nvmet_passthru_alloc_ctrl(subsys);
+ if (ret)
+ goto out_free_sqs;
+
ret = ida_simple_get(&cntlid_ida,
NVME_CNTLID_MIN, NVME_CNTLID_MAX,
GFP_KERNEL);
@@ -1341,6 +1346,7 @@ static void nvmet_ctrl_free(struct kref *ref)
flush_work(&ctrl->async_event_work);
cancel_work_sync(&ctrl->fatal_err_work);

+ nvmet_passthru_ctrl_free(subsys);
ida_simple_remove(&cntlid_ida, ctrl->cntlid);

kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd-passthru.c b/drivers/nvme/target/io-cmd-passthru.c
index 9ddcdb8415fc..aefdd534cb4a 100644
--- a/drivers/nvme/target/io-cmd-passthru.c
+++ b/drivers/nvme/target/io-cmd-passthru.c
@@ -104,6 +104,38 @@ void nvmet_passthru_subsys_free(struct nvmet_subsys *subsys)
mutex_unlock(&subsys->lock);
}

+int nvmet_passthru_alloc_ctrl(struct nvmet_subsys *subsys)
+{
+ /*
+ * Check here if this subsystem is already connected to the passthru
+ * ctrl. We allow only one target ctrl for one passthru subsystem.
+ */
+
+ mutex_lock(&subsys->lock);
+
+ if (!subsys->passthru_ctrl)
+ goto out;
+
+ if (subsys->passthru_connected) {
+ mutex_unlock(&subsys->lock);
+ return -ENODEV;
+ }
+
+ subsys->passthru_connected = true;
+
+out:
+ mutex_unlock(&subsys->lock);
+
+ return 0;
+}
+
+void nvmet_passthru_ctrl_free(struct nvmet_subsys *subsys)
+{
+ mutex_lock(&subsys->lock);
+ subsys->passthru_connected = false;
+ mutex_unlock(&subsys->lock);
+}
+
static void nvmet_passthru_req_complete(struct nvmet_req *req,
struct request *rq, u16 status)
{
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index aff4db03269d..004949b6b666 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -231,6 +231,7 @@ struct nvmet_subsys {
#ifdef CONFIG_NVME_TARGET_PASSTHRU
struct nvme_ctrl *passthru_ctrl;
char *passthru_ctrl_path;
+ bool passthru_connected;
#endif /* CONFIG_NVME_TARGET_PASSTHRU */
};

@@ -513,6 +514,8 @@ void nvmet_passthru_destroy(void);
void nvmet_passthru_subsys_free(struct nvmet_subsys *subsys);
int nvmet_passthru_ctrl_enable(struct nvmet_subsys *subsys);
void nvmet_passthru_ctrl_disable(struct nvmet_subsys *subsys);
+int nvmet_passthru_alloc_ctrl(struct nvmet_subsys *subsys);
+void nvmet_passthru_ctrl_free(struct nvmet_subsys *subsys);
u16 nvmet_parse_passthru_cmd(struct nvmet_req *req);

static inline
@@ -536,6 +539,13 @@ static inline void nvmet_passthru_subsys_free(struct nvmet_subsys *subsys)
static inline void nvmet_passthru_ctrl_disable(struct nvmet_subsys *subsys)
{
}
+static inline int nvmet_passthru_alloc_ctrl(struct nvmet_subsys *subsys)
+{
+ return 0;
+}
+static inline void nvmet_passthru_ctrl_free(struct nvmet_subsys *subsys)
+{
+}
static inline u16 nvmet_parse_passthru_cmd(struct nvmet_req *req)
{
return 0;
--
2.20.1


2019-07-25 18:12:51

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 03/16] chardev: export cdev_put()

In order to use the new cdev_get_by_path() helper in a module,
cdev_put() must be exported as well.

Signed-off-by: Logan Gunthorpe <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Alexander Viro <[email protected]>
---
fs/char_dev.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/char_dev.c b/fs/char_dev.c
index 25037d642ff8..5ab5d7da29bf 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -724,5 +724,6 @@ EXPORT_SYMBOL(cdev_set_parent);
EXPORT_SYMBOL(cdev_device_add);
EXPORT_SYMBOL(cdev_device_del);
EXPORT_SYMBOL(cdev_get_by_path);
+EXPORT_SYMBOL(cdev_put);
EXPORT_SYMBOL(__register_chrdev);
EXPORT_SYMBOL(__unregister_chrdev);
--
2.20.1


2019-07-25 18:12:52

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 12/16] nvmet-core: don't check the data len for pt-ctrl

From: Chaitanya Kulkarni <[email protected]>

Right now, data_len is calculated before the transfer len after we
parse the command, With passthru interface we allow VUCs (Vendor-Unique
Commands). In order to make the code simple and compact, instead of
assigning the data len or each VUC in the command parse function
just use the transfer len as it is. This may result in error if expected
data_len != transfer_len.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
[[email protected]:
* added definition of VUC to the commit message and comment
* use nvmet_req_passthru_ctrl() helper seeing we can't dereference
subsys->passthru_ctrl if CONFIG_NVME_TARGET_PASSTHRU is not set]
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/target/core.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 9e92486e2ee9..77660dfc6c8f 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -942,7 +942,16 @@ EXPORT_SYMBOL_GPL(nvmet_req_uninit);

void nvmet_req_execute(struct nvmet_req *req)
{
- if (unlikely(req->data_len != req->transfer_len)) {
+ /*
+ * data_len is calculated before the transfer len after we parse
+ * the command, With passthru interface we allow VUC (Vendor-Unique
+ * Commands)'s. In order to make the code simple and compact,
+ * instead of assinging the dala len for each VUC in the command
+ * parse function just use the transfer len as it is. This may
+ * result in error if expected data_len != transfer_len.
+ */
+ if (!(req->sq->ctrl && nvmet_req_passthru_ctrl(req)) &&
+ unlikely(req->data_len != req->transfer_len)) {
req->error_loc = offsetof(struct nvme_common_command, dptr);
nvmet_req_complete(req, NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR);
} else
--
2.20.1


2019-07-25 18:12:56

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 13/16] nvmet-configfs: introduce passthru configfs interface

When CONFIG_NVME_TARGET_PASSTHRU as 'passthru' directory will
be added to each subsystem. The directory is similar to a namespace
and has two attributes: device_path and enable. The user must set the
path to the nvme controller's char device and write '1' to enable the
subsystem to use passthru.

Any given subsystem is prevented from enabling both a regular namespace
and the passthru device. If one is enabled, enabling the other will
produce an error.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/target/configfs.c | 99 ++++++++++++++++++++++++++++++++++
drivers/nvme/target/nvmet.h | 1 +
2 files changed, 100 insertions(+)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 98613a45bd3b..b15d64c19f58 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -615,6 +615,103 @@ static const struct config_item_type nvmet_namespaces_type = {
.ct_owner = THIS_MODULE,
};

+#ifdef CONFIG_NVME_TARGET_PASSTHRU
+
+static ssize_t nvmet_passthru_device_path_show(struct config_item *item,
+ char *page)
+{
+ struct nvmet_subsys *subsys = to_subsys(item->ci_parent);
+
+ return snprintf(page, PAGE_SIZE, "%s\n", subsys->passthru_ctrl_path);
+}
+
+static ssize_t nvmet_passthru_device_path_store(struct config_item *item,
+ const char *page, size_t count)
+{
+ struct nvmet_subsys *subsys = to_subsys(item->ci_parent);
+ int ret = -ENOMEM;
+ size_t len;
+
+ mutex_lock(&subsys->lock);
+
+ ret = -EBUSY;
+ if (subsys->passthru_ctrl)
+ goto out_unlock;
+
+ ret = -EINVAL;
+ len = strcspn(page, "\n");
+ if (!len)
+ goto out_unlock;
+
+ kfree(subsys->passthru_ctrl_path);
+ ret = -ENOMEM;
+ subsys->passthru_ctrl_path = kstrndup(page, len, GFP_KERNEL);
+ if (!subsys->passthru_ctrl_path)
+ goto out_unlock;
+
+ mutex_unlock(&subsys->lock);
+
+ return count;
+out_unlock:
+ mutex_unlock(&subsys->lock);
+ return ret;
+}
+CONFIGFS_ATTR(nvmet_passthru_, device_path);
+
+static ssize_t nvmet_passthru_enable_show(struct config_item *item,
+ char *page)
+{
+ struct nvmet_subsys *subsys = to_subsys(item->ci_parent);
+
+ return sprintf(page, "%d\n", subsys->passthru_ctrl ? 1 : 0);
+}
+
+static ssize_t nvmet_passthru_enable_store(struct config_item *item,
+ const char *page, size_t count)
+{
+ struct nvmet_subsys *subsys = to_subsys(item->ci_parent);
+ bool enable;
+ int ret = 0;
+
+ if (strtobool(page, &enable))
+ return -EINVAL;
+
+ if (enable)
+ ret = nvmet_passthru_ctrl_enable(subsys);
+ else
+ nvmet_passthru_ctrl_disable(subsys);
+
+ return ret ? ret : count;
+}
+CONFIGFS_ATTR(nvmet_passthru_, enable);
+
+static struct configfs_attribute *nvmet_passthru_attrs[] = {
+ &nvmet_passthru_attr_device_path,
+ &nvmet_passthru_attr_enable,
+ NULL,
+};
+
+static const struct config_item_type nvmet_passthru_type = {
+ .ct_attrs = nvmet_passthru_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static void nvmet_add_passthru_group(struct nvmet_subsys *subsys)
+{
+ config_group_init_type_name(&subsys->passthru_group,
+ "passthru", &nvmet_passthru_type);
+ configfs_add_default_group(&subsys->passthru_group,
+ &subsys->group);
+}
+
+#else /* CONFIG_NVME_TARGET_PASSTHRU */
+
+static void nvmet_add_passthru_group(struct nvmet_subsys *subsys)
+{
+}
+
+#endif /* CONFIG_NVME_TARGET_PASSTHRU */
+
static int nvmet_port_subsys_allow_link(struct config_item *parent,
struct config_item *target)
{
@@ -915,6 +1012,8 @@ static struct config_group *nvmet_subsys_make(struct config_group *group,
configfs_add_default_group(&subsys->allowed_hosts_group,
&subsys->group);

+ nvmet_add_passthru_group(subsys);
+
return &subsys->group;
}

diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 004949b6b666..fdcfadb7eb61 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -232,6 +232,7 @@ struct nvmet_subsys {
struct nvme_ctrl *passthru_ctrl;
char *passthru_ctrl_path;
bool passthru_connected;
+ struct config_group passthru_group;
#endif /* CONFIG_NVME_TARGET_PASSTHRU */
};

--
2.20.1


2019-07-25 18:12:58

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 14/16] block: don't check blk_rq_is_passthrough() in blk_do_io_stat()

Instead of checking blk_rq_is_passthruough() for every call to
blk_do_io_stat(), don't set RQF_IO_STAT for passthrough requests.
This should be equivalent, and opens the possibility of passthrough
requests specifically requesting statistics tracking.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
block/blk-mq.c | 2 +-
block/blk.h | 5 ++---
2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b038ec680e84..b8d41d6824f6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -318,7 +318,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
rq->cmd_flags = op;
if (data->flags & BLK_MQ_REQ_PREEMPT)
rq->rq_flags |= RQF_PREEMPT;
- if (blk_queue_io_stat(data->q))
+ if (blk_queue_io_stat(data->q) && !blk_rq_is_passthrough(rq))
rq->rq_flags |= RQF_IO_STAT;
INIT_LIST_HEAD(&rq->queuelist);
INIT_HLIST_NODE(&rq->hash);
diff --git a/block/blk.h b/block/blk.h
index de6b2e146d6e..554efa769bfe 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -234,13 +234,12 @@ int blk_dev_init(void);
*
* a) it's attached to a gendisk, and
* b) the queue had IO stats enabled when this request was started, and
- * c) it's a file system request
+ * c) it's a file system request (RQF_IO_STAT will not be set otherwise)
*/
static inline bool blk_do_io_stat(struct request *rq)
{
return rq->rq_disk &&
- (rq->rq_flags & RQF_IO_STAT) &&
- !blk_rq_is_passthrough(rq);
+ (rq->rq_flags & RQF_IO_STAT);
}

static inline void req_set_nomerge(struct request_queue *q, struct request *req)
--
2.20.1


2019-07-25 18:13:03

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 10/16] nvmet-passthru: add enable/disable helpers

This patch adds helper functions which are used in the NVMeOF configfs
when the user is configuring the passthru subsystem. Here we ensure
that only one subsys is assigned to each nvme_ctrl by using an xarray
on the cntlid.

[[email protected]: this patch is very roughly based
on a similar one by Chaitanya]
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/target/core.c | 8 +++
drivers/nvme/target/io-cmd-passthru.c | 77 +++++++++++++++++++++++++++
drivers/nvme/target/nvmet.h | 10 ++++
3 files changed, 95 insertions(+)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 50c01b2da568..2e75968af7f4 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -519,6 +519,12 @@ int nvmet_ns_enable(struct nvmet_ns *ns)

mutex_lock(&subsys->lock);
ret = 0;
+
+ if (nvmet_passthru_ctrl(subsys)) {
+ pr_info("cannot enable both passthru and regular namespaces for a single subsystem");
+ goto out_unlock;
+ }
+
if (ns->enabled)
goto out_unlock;

@@ -1439,6 +1445,8 @@ static void nvmet_subsys_free(struct kref *ref)

WARN_ON_ONCE(!list_empty(&subsys->namespaces));

+ nvmet_passthru_subsys_free(subsys);
+
kfree(subsys->subsysnqn);
kfree(subsys);
}
diff --git a/drivers/nvme/target/io-cmd-passthru.c b/drivers/nvme/target/io-cmd-passthru.c
index 920b102ec13b..9ddcdb8415fc 100644
--- a/drivers/nvme/target/io-cmd-passthru.c
+++ b/drivers/nvme/target/io-cmd-passthru.c
@@ -11,6 +11,11 @@
#include "../host/nvme.h"
#include "nvmet.h"

+/*
+ * xarray to maintain one passthru subsystem per nvme controller.
+ */
+static DEFINE_XARRAY(passthru_subsystems);
+
static struct workqueue_struct *passthru_wq;

int nvmet_passthru_init(void)
@@ -27,6 +32,78 @@ void nvmet_passthru_destroy(void)
destroy_workqueue(passthru_wq);
}

+int nvmet_passthru_ctrl_enable(struct nvmet_subsys *subsys)
+{
+ struct nvme_ctrl *ctrl;
+ int ret = -EINVAL;
+ void *old;
+
+ mutex_lock(&subsys->lock);
+ if (!subsys->passthru_ctrl_path)
+ goto out_unlock;
+ if (subsys->passthru_ctrl)
+ goto out_unlock;
+
+ if (subsys->nr_namespaces) {
+ pr_info("cannot enable both passthru and regular namespaces for a single subsystem");
+ goto out_unlock;
+ }
+
+ ctrl = nvme_ctrl_get_by_path(subsys->passthru_ctrl_path);
+ if (IS_ERR(ctrl)) {
+ ret = PTR_ERR(ctrl);
+ pr_err("failed to open nvme controller %s\n",
+ subsys->passthru_ctrl_path);
+
+ goto out_unlock;
+ }
+
+ old = xa_cmpxchg(&passthru_subsystems, ctrl->cntlid, NULL,
+ subsys, GFP_KERNEL);
+ if (xa_is_err(old)) {
+ ret = xa_err(old);
+ goto out_put_ctrl;
+ }
+
+ if (old)
+ goto out_put_ctrl;
+
+ subsys->passthru_ctrl = ctrl;
+ ret = 0;
+
+ goto out_unlock;
+
+out_put_ctrl:
+ nvme_put_ctrl(ctrl);
+out_unlock:
+ mutex_unlock(&subsys->lock);
+ return ret;
+}
+
+static void __nvmet_passthru_ctrl_disable(struct nvmet_subsys *subsys)
+{
+ if (subsys->passthru_ctrl) {
+ xa_erase(&passthru_subsystems, subsys->passthru_ctrl->cntlid);
+ nvme_put_ctrl(subsys->passthru_ctrl);
+ }
+ subsys->passthru_ctrl = NULL;
+}
+
+void nvmet_passthru_ctrl_disable(struct nvmet_subsys *subsys)
+{
+ mutex_lock(&subsys->lock);
+ __nvmet_passthru_ctrl_disable(subsys);
+ mutex_unlock(&subsys->lock);
+}
+
+void nvmet_passthru_subsys_free(struct nvmet_subsys *subsys)
+{
+ mutex_lock(&subsys->lock);
+ __nvmet_passthru_ctrl_disable(subsys);
+ kfree(subsys->passthru_ctrl_path);
+ mutex_unlock(&subsys->lock);
+}
+
static void nvmet_passthru_req_complete(struct nvmet_req *req,
struct request *rq, u16 status)
{
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index bd11114ebbb9..aff4db03269d 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -230,6 +230,7 @@ struct nvmet_subsys {

#ifdef CONFIG_NVME_TARGET_PASSTHRU
struct nvme_ctrl *passthru_ctrl;
+ char *passthru_ctrl_path;
#endif /* CONFIG_NVME_TARGET_PASSTHRU */
};

@@ -509,6 +510,9 @@ static inline u32 nvmet_rw_len(struct nvmet_req *req)

int nvmet_passthru_init(void);
void nvmet_passthru_destroy(void);
+void nvmet_passthru_subsys_free(struct nvmet_subsys *subsys);
+int nvmet_passthru_ctrl_enable(struct nvmet_subsys *subsys);
+void nvmet_passthru_ctrl_disable(struct nvmet_subsys *subsys);
u16 nvmet_parse_passthru_cmd(struct nvmet_req *req);

static inline
@@ -526,6 +530,12 @@ static inline int nvmet_passthru_init(void)
static inline void nvmet_passthru_destroy(void)
{
}
+static inline void nvmet_passthru_subsys_free(struct nvmet_subsys *subsys)
+{
+}
+static inline void nvmet_passthru_ctrl_disable(struct nvmet_subsys *subsys)
+{
+}
static inline u16 nvmet_parse_passthru_cmd(struct nvmet_req *req)
{
return 0;
--
2.20.1


2019-07-25 18:13:07

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 04/16] nvme-core: introduce nvme_get_by_path()

nvme_get_by_path() is analagous to blkdev_get_by_path() except it
gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).

The purpose of this function is to support NVMe-OF target passthru.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/core.c | 23 +++++++++++++++++++++++
drivers/nvme/host/nvme.h | 2 ++
2 files changed, 25 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 00399c3536fa..61e7b016ec36 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2816,6 +2816,29 @@ static const struct file_operations nvme_dev_fops = {
.compat_ioctl = nvme_dev_ioctl,
};

+struct nvme_ctrl *nvme_ctrl_get_by_path(const char *path)
+{
+ struct nvme_ctrl *ctrl;
+ struct cdev *cdev;
+
+ cdev = cdev_get_by_path(path);
+ if (IS_ERR(cdev))
+ return ERR_CAST(cdev);
+
+ if (cdev->ops != &nvme_dev_fops) {
+ ctrl = ERR_PTR(-EINVAL);
+ goto out_cdev_put;
+ }
+
+ ctrl = container_of(cdev, struct nvme_ctrl, cdev);
+ nvme_get_ctrl(ctrl);
+
+out_cdev_put:
+ cdev_put(cdev);
+ return ctrl;
+}
+EXPORT_SYMBOL_GPL(nvme_ctrl_get_by_path);
+
static ssize_t nvme_sysfs_reset(struct device *dev,
struct device_attribute *attr, const char *buf,
size_t count)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index da3d130773c4..a19bd0091de6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -484,6 +484,8 @@ int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
extern const struct attribute_group *nvme_ns_id_attr_groups[];
extern const struct block_device_operations nvme_ns_head_ops;

+struct nvme_ctrl *nvme_ctrl_get_by_path(const char *path);
+
#ifdef CONFIG_NVME_MULTIPATH
bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl);
void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
--
2.20.1


2019-07-25 18:13:07

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 09/16] nvmet-passthru: add passthru code to process commands

From: Chaitanya Kulkarni <[email protected]>

This patch adds passthru command handling capability for the NVMeOF
target and exports passthru APIs which are used to integrate passthru
code with nvmet-core. We add passthru ns member to the target request
to hold the ns reference for respective commands.

The new file io-cmd-passthru.c handles passthru cmd parsing and execution.
In the passthru mode, we create a block layer request from the nvmet
request and map the data on to the block layer request. For handling the
side effects of the passthru admin commands we add two functions similar
to the nvme_passthru[start|end]() functions present in the nvme-core. We
explicitly blacklist the commands at the time of parsing, which allows
us to route the fabric commands through default code path.

We introduce new passthru workqueue similar to the one we have for the
file backend for NVMeOF target to execute the NVMe Admin passthru
commands.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
[[email protected]:
* renamed passthru-cmd.c to io-cmd-passthru.c for consistency
* squashed "update target makefile for passthru"
* squashed "integrate passthru request processing"
* added appropriate CONFIG_NVME_TARGET_PASSTHRU #ifdefs
* pushed passthru_wq into passthrtu.c and introduced
nvmet_passthru_init() and nvmet_passthru_destroy() to avoid
inline #ifdef mess
* renamed nvmet_passthru_ctrl() to nvmet_req_passthru_ctrl()
and provided nvmet_passthr_ctrl() to get the ctrl from a subsys
* fixed failure path in nvmet_passthru_execute_cmd() to ensure
we always complete the request (with an error when appropriate)
* restructered out nvmet_passthru_make_request() and
nvmet_passthru_execute_cmd() to create nvmet_passthru_map_sg()
which makes the code simpler and more readable.
* move call to nvme_find_get_ns() into nvmet_passthru_execute_cmd()
to prevent a lockdep error. nvme_find_get_ns() takes a
lock and can sleep but nvme_init_req() is called while
hctx_lock() is held (in the loop transport) and therefore
should not sleep.
* added check in nvmet_passthru_execute_cmd() to ensure we don't
violate queue_max_segments or queue_max_hw_sectors.
* added nvmet_passthru_set_mdts() to prevent requests that
exceed max_segments
* dropped le16_to_cpu() conversion in nvmet_passthru_req_done() as
it's currently already done in nvme_end_request()
* unabbreviated 'VUC' in a comment as it's not a commonly known
acronym
* removed unnecessary inline tags on static functions
* minor edits to commit message
]
Signed-off-by: Logan Gunthorpe <[email protected]>

squash! nvmet-passthru: add passthru code to process commands
---
drivers/nvme/target/Makefile | 1 +
drivers/nvme/target/core.c | 11 +-
drivers/nvme/target/io-cmd-passthru.c | 569 ++++++++++++++++++++++++++
drivers/nvme/target/nvmet.h | 46 +++
4 files changed, 626 insertions(+), 1 deletion(-)
create mode 100644 drivers/nvme/target/io-cmd-passthru.c

diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
index 2b33836f3d3e..bf57799fde63 100644
--- a/drivers/nvme/target/Makefile
+++ b/drivers/nvme/target/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o

nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \
discovery.o io-cmd-file.o io-cmd-bdev.o
+nvmet-$(CONFIG_NVME_TARGET_PASSTHRU) += io-cmd-passthru.o
nvme-loop-y += loop.o
nvmet-rdma-y += rdma.o
nvmet-fc-y += fc.o
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index f7f25bdc4763..50c01b2da568 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -895,6 +895,8 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
if (unlikely(!req->sq->ctrl))
/* will return an error for any Non-connect command: */
status = nvmet_parse_connect_cmd(req);
+ else if (nvmet_req_passthru_ctrl(req))
+ status = nvmet_parse_passthru_cmd(req);
else if (likely(req->sq->qid != 0))
status = nvmet_parse_io_cmd(req);
else if (nvme_is_fabrics(req->cmd))
@@ -1462,11 +1464,15 @@ static int __init nvmet_init(void)

nvmet_ana_group_enabled[NVMET_DEFAULT_ANA_GRPID] = 1;

+ error = nvmet_passthru_init();
+ if (error)
+ goto out;
+
buffered_io_wq = alloc_workqueue("nvmet-buffered-io-wq",
WQ_MEM_RECLAIM, 0);
if (!buffered_io_wq) {
error = -ENOMEM;
- goto out;
+ goto out_passthru_destroy;
}

error = nvmet_init_discovery();
@@ -1482,6 +1488,8 @@ static int __init nvmet_init(void)
nvmet_exit_discovery();
out_free_work_queue:
destroy_workqueue(buffered_io_wq);
+out_passthru_destroy:
+ nvmet_passthru_destroy();
out:
return error;
}
@@ -1492,6 +1500,7 @@ static void __exit nvmet_exit(void)
nvmet_exit_discovery();
ida_destroy(&cntlid_ida);
destroy_workqueue(buffered_io_wq);
+ nvmet_passthru_destroy();

BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_entry) != 1024);
BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_hdr) != 1024);
diff --git a/drivers/nvme/target/io-cmd-passthru.c b/drivers/nvme/target/io-cmd-passthru.c
new file mode 100644
index 000000000000..920b102ec13b
--- /dev/null
+++ b/drivers/nvme/target/io-cmd-passthru.c
@@ -0,0 +1,569 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NVMe Over Fabrics Target Passthrough command implementation.
+ *
+ * Copyright (c) 2017-2018 Western Digital Corporation or its
+ * affiliates.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+
+#include "../host/nvme.h"
+#include "nvmet.h"
+
+static struct workqueue_struct *passthru_wq;
+
+int nvmet_passthru_init(void)
+{
+ passthru_wq = alloc_workqueue("nvmet-passthru-wq", WQ_MEM_RECLAIM, 0);
+ if (!passthru_wq)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void nvmet_passthru_destroy(void)
+{
+ destroy_workqueue(passthru_wq);
+}
+
+static void nvmet_passthru_req_complete(struct nvmet_req *req,
+ struct request *rq, u16 status)
+{
+ nvmet_req_complete(req, status);
+
+ if (rq)
+ blk_put_request(rq);
+}
+
+static void nvmet_passthru_req_done(struct request *rq,
+ blk_status_t blk_status)
+{
+ struct nvmet_req *req = rq->end_io_data;
+ u16 status = nvme_req(rq)->status;
+
+ req->cqe->result.u32 = nvme_req(rq)->result.u32;
+
+ nvmet_passthru_req_complete(req, rq, status);
+}
+
+static u16 nvmet_passthru_override_format_nvm(struct nvmet_req *req)
+{
+ int lbaf = le32_to_cpu(req->cmd->format.cdw10) & 0x0000000F;
+ int nsid = le32_to_cpu(req->cmd->format.nsid);
+ u16 status = NVME_SC_SUCCESS;
+ struct nvme_id_ns *id;
+
+ id = nvme_identify_ns(nvmet_req_passthru_ctrl(req), nsid);
+ if (!id)
+ return NVME_SC_INTERNAL;
+ /*
+ * XXX: Please update this code once NVMeOF target starts supporting
+ * metadata. We don't support ns lba format with metadata over fabrics
+ * right now, so report an error if format nvm cmd tries to format
+ * a namespace with the LBA format which has metadata.
+ */
+ if (id->lbaf[lbaf].ms)
+ status = NVME_SC_INVALID_NS;
+
+ kfree(id);
+ return status;
+}
+
+static void nvmet_passthru_set_mdts(struct nvmet_ctrl *ctrl,
+ struct nvme_id_ctrl *id)
+{
+ struct nvme_ctrl *pctrl = ctrl->subsys->passthru_ctrl;
+ u32 max_hw_sectors;
+ int page_shift;
+
+ /*
+ * The passthru NVMe driver may have a limit on the number
+ * of segments which depends on the host's memory fragementation.
+ * To solve this, ensure mdts is limitted to the pages equal to
+ * the number of segments.
+ */
+
+ max_hw_sectors = min_not_zero(pctrl->max_segments << (PAGE_SHIFT - 9),
+ pctrl->max_hw_sectors);
+
+ page_shift = NVME_CAP_MPSMIN(ctrl->cap) + 12;
+
+ id->mdts = ilog2(max_hw_sectors) + 9 - page_shift;
+}
+
+static u16 nvmet_passthru_override_id_ctrl(struct nvmet_req *req)
+{
+ struct nvmet_ctrl *ctrl = req->sq->ctrl;
+ u16 status = NVME_SC_SUCCESS;
+ struct nvme_id_ctrl *id;
+
+ id = kzalloc(sizeof(*id), GFP_KERNEL);
+ if (!id) {
+ status = NVME_SC_INTERNAL;
+ goto out;
+ }
+ status = nvmet_copy_from_sgl(req, 0, id, sizeof(struct nvme_id_ctrl));
+ if (status)
+ goto out_free;
+
+ id->cntlid = cpu_to_le16(ctrl->cntlid);
+ id->ver = cpu_to_le32(ctrl->subsys->ver);
+
+ nvmet_passthru_set_mdts(ctrl, id);
+
+ id->acl = 3;
+ /*
+ * We export aerl limit for the fabrics controller, update this when
+ * passthru based aerl support is added.
+ */
+ id->aerl = NVMET_ASYNC_EVENTS - 1;
+
+ /* emulate kas as most of the PCIe ctrl don't have a support for kas */
+ id->kas = cpu_to_le16(NVMET_KAS);
+
+ /* don't support host memory buffer */
+ id->hmpre = 0;
+ id->hmmin = 0;
+
+ id->sqes = min_t(__u8, ((0x6 << 4) | 0x6), id->sqes);
+ id->cqes = min_t(__u8, ((0x4 << 4) | 0x4), id->cqes);
+ id->maxcmd = cpu_to_le16(NVMET_MAX_CMD);
+
+ /* don't support fuse commands */
+ id->fuses = 0;
+
+ id->sgls = cpu_to_le32(1 << 0); /* we always support SGLs */
+ if (ctrl->ops->has_keyed_sgls)
+ id->sgls |= cpu_to_le32(1 << 2);
+ if (req->port->inline_data_size)
+ id->sgls |= cpu_to_le32(1 << 20);
+
+ /*
+ * When passsthru controller is setup using nvme-loop transport it will
+ * export the passthru ctrl subsysnqn (PCIe NVMe ctrl) and will fail in
+ * the nvme/host/core.c in the nvme_init_subsystem()->nvme_active_ctrl()
+ * code path with duplicate ctr subsynqn. In order to prevent that we
+ * mask the passthru-ctrl subsysnqn with the target ctrl subsysnqn.
+ */
+ memcpy(id->subnqn, ctrl->subsysnqn, sizeof(id->subnqn));
+
+ /* use fabric id-ctrl values */
+ id->ioccsz = cpu_to_le32((sizeof(struct nvme_command) +
+ req->port->inline_data_size) / 16);
+ id->iorcsz = cpu_to_le32(sizeof(struct nvme_completion) / 16);
+
+ id->msdbd = ctrl->ops->msdbd;
+
+ status = nvmet_copy_to_sgl(req, 0, id, sizeof(struct nvme_id_ctrl));
+
+out_free:
+ kfree(id);
+out:
+ return status;
+}
+
+static u16 nvmet_passthru_override_id_ns(struct nvmet_req *req)
+{
+ u16 status = NVME_SC_SUCCESS;
+ struct nvme_id_ns *id;
+ int i;
+
+ id = kzalloc(sizeof(*id), GFP_KERNEL);
+ if (!id) {
+ status = NVME_SC_INTERNAL;
+ goto out;
+ }
+
+ status = nvmet_copy_from_sgl(req, 0, id, sizeof(struct nvme_id_ns));
+ if (status)
+ goto out_free;
+
+ for (i = 0; i < (id->nlbaf + 1); i++)
+ if (id->lbaf[i].ms)
+ memset(&id->lbaf[i], 0, sizeof(id->lbaf[i]));
+
+ id->flbas = id->flbas & ~(1 << 4);
+ id->mc = 0;
+
+ status = nvmet_copy_to_sgl(req, 0, id, sizeof(*id));
+
+out_free:
+ kfree(id);
+out:
+ return status;
+}
+
+static u16 nvmet_passthru_fixup_identify(struct nvmet_req *req)
+{
+ u16 status = NVME_SC_SUCCESS;
+
+ switch (req->cmd->identify.cns) {
+ case NVME_ID_CNS_CTRL:
+ status = nvmet_passthru_override_id_ctrl(req);
+ break;
+ case NVME_ID_CNS_NS:
+ status = nvmet_passthru_override_id_ns(req);
+ break;
+ }
+ return status;
+}
+
+static u16 nvmet_passthru_admin_passthru_start(struct nvmet_req *req)
+{
+ u16 status = NVME_SC_SUCCESS;
+
+ switch (req->cmd->common.opcode) {
+ case nvme_admin_format_nvm:
+ status = nvmet_passthru_override_format_nvm(req);
+ break;
+ }
+ return status;
+}
+
+static u16 nvmet_passthru_admin_passthru_end(struct nvmet_req *req)
+{
+ u8 aer_type = NVME_AER_TYPE_NOTICE;
+ u16 status = NVME_SC_SUCCESS;
+
+ switch (req->cmd->common.opcode) {
+ case nvme_admin_identify:
+ status = nvmet_passthru_fixup_identify(req);
+ break;
+ case nvme_admin_ns_mgmt:
+ case nvme_admin_ns_attach:
+ case nvme_admin_format_nvm:
+ if (!nvmet_add_async_event(req->sq->ctrl, aer_type, 0, 0))
+ status = NVME_SC_INTERNAL;
+ break;
+ }
+ return status;
+}
+
+static void nvmet_passthru_execute_admin_cmd(struct nvmet_req *req)
+{
+ u8 opcode = req->cmd->common.opcode;
+ u32 effects;
+ u16 status;
+
+ status = nvmet_passthru_admin_passthru_start(req);
+ if (status)
+ goto out;
+
+ effects = nvme_passthru_start(nvmet_req_passthru_ctrl(req), NULL,
+ opcode);
+
+ /*
+ * Admin Commands have side effects and it is better to handle those
+ * side effects in the submission thread context than in the request
+ * completion path, which is in the interrupt context. Also in this
+ * way, we keep the passhru admin command code path consistent with the
+ * nvme/host/core.c sync command submission APIs/IOCTLs and use
+ * nvme_passthru_start/end() to handle side effects consistently.
+ */
+ blk_execute_rq(req->p.rq->q, NULL, req->p.rq, 0);
+
+ nvme_passthru_end(nvmet_req_passthru_ctrl(req), effects);
+ status = nvmet_passthru_admin_passthru_end(req);
+out:
+ if (status == NVME_SC_SUCCESS) {
+ nvmet_set_result(req, nvme_req(req->p.rq)->result.u32);
+ status = nvme_req(req->p.rq)->status;
+ }
+
+ nvmet_passthru_req_complete(req, req->p.rq, status);
+}
+
+static int nvmet_passthru_map_sg(struct nvmet_req *req, struct request *rq)
+{
+ int sg_cnt = req->sg_cnt;
+ struct scatterlist *sg;
+ int op = REQ_OP_READ;
+ int op_flags = 0;
+ struct bio *bio;
+ int i, ret;
+
+ if (nvme_is_write(req->cmd)) {
+ op = REQ_OP_WRITE;
+ op_flags = REQ_SYNC | REQ_IDLE;
+ }
+
+ bio = bio_alloc(GFP_KERNEL, min(sg_cnt, BIO_MAX_PAGES));
+ bio->bi_end_io = bio_put;
+
+ for_each_sg(req->sg, sg, req->sg_cnt, i) {
+ if (bio_add_page(bio, sg_page(sg), sg->length,
+ sg->offset) != sg->length) {
+ ret = blk_rq_append_bio(rq, &bio);
+ if (unlikely(ret))
+ return ret;
+
+ bio_set_op_attrs(bio, op, op_flags);
+ bio = bio_alloc(GFP_KERNEL,
+ min(sg_cnt, BIO_MAX_PAGES));
+ }
+ sg_cnt--;
+ }
+
+ ret = blk_rq_append_bio(rq, &bio);
+ if (unlikely(ret))
+ return ret;
+
+ return 0;
+}
+
+static struct request *nvmet_passthru_blk_make_request(struct nvmet_req *req,
+ struct nvme_ns *ns, gfp_t gfp_mask)
+{
+ struct nvme_ctrl *passthru_ctrl = nvmet_req_passthru_ctrl(req);
+ struct nvme_command *cmd = req->cmd;
+ struct request_queue *q;
+ struct request *rq;
+ int ret;
+
+ if (ns)
+ q = ns->queue;
+ else
+ q = passthru_ctrl->admin_q;
+
+ rq = nvme_alloc_request(q, cmd, BLK_MQ_REQ_NOWAIT, NVME_QID_ANY);
+ if (unlikely(IS_ERR(rq)))
+ return rq;
+
+ if (req->sg_cnt) {
+ ret = nvmet_passthru_map_sg(req, rq);
+ if (unlikely(ret)) {
+ blk_put_request(rq);
+ return ERR_PTR(ret);
+ }
+ }
+
+ /*
+ * We don't support fused cmds, also nvme-pci driver uses its own
+ * sgl_threshold parameter to decide whether to use SGLs or PRPs hence
+ * turn off those bits in the flags.
+ */
+ req->cmd->common.flags &= ~(NVME_CMD_FUSE_FIRST | NVME_CMD_FUSE_SECOND |
+ NVME_CMD_SGL_ALL);
+ return rq;
+}
+
+
+static void nvmet_passthru_execute_admin_work(struct work_struct *w)
+{
+ struct nvmet_req *req = container_of(w, struct nvmet_req, p.work);
+
+ nvmet_passthru_execute_admin_cmd(req);
+}
+
+static void nvmet_passthru_submit_admin_cmd(struct nvmet_req *req)
+{
+ INIT_WORK(&req->p.work, nvmet_passthru_execute_admin_work);
+ queue_work(passthru_wq, &req->p.work);
+}
+
+static void nvmet_passthru_execute_cmd(struct nvmet_req *req)
+{
+ struct request *rq = NULL;
+ struct nvme_ns *ns = NULL;
+ u16 status;
+
+ if (likely(req->sq->qid != 0)) {
+ u32 nsid = le32_to_cpu(req->cmd->common.nsid);
+
+ ns = nvme_find_get_ns(nvmet_req_passthru_ctrl(req), nsid);
+ if (unlikely(!ns)) {
+ pr_err("failed to get passthru ns nsid:%u\n", nsid);
+ status = NVME_SC_INVALID_NS | NVME_SC_DNR;
+ goto fail_out;
+ }
+ }
+
+ rq = nvmet_passthru_blk_make_request(req, ns, GFP_KERNEL);
+ if (unlikely(IS_ERR(rq))) {
+ rq = NULL;
+ status = NVME_SC_INTERNAL;
+ goto fail_out;
+ }
+
+ if (unlikely(blk_rq_nr_phys_segments(rq) > queue_max_segments(rq->q) ||
+ (blk_rq_payload_bytes(rq) >> 9) > queue_max_hw_sectors(rq->q))) {
+ status = NVME_SC_INVALID_FIELD;
+ goto fail_out;
+ }
+
+ rq->end_io_data = req;
+ if (req->sq->qid != 0) {
+ blk_execute_rq_nowait(rq->q, NULL, rq, 0,
+ nvmet_passthru_req_done);
+ } else {
+ req->p.rq = rq;
+ nvmet_passthru_submit_admin_cmd(req);
+ }
+
+ if (ns)
+ nvme_put_ns(ns);
+
+ return;
+
+fail_out:
+ if (ns)
+ nvme_put_ns(ns);
+ nvmet_passthru_req_complete(req, rq, status);
+}
+
+/*
+ * We emulate commands which are not routed through the existing target
+ * code and not supported by the passthru ctrl. E.g consider a scenario where
+ * passthru ctrl version is < 1.3.0. Target Fabrics ctrl version is >= 1.3.0
+ * in that case in order to be fabrics compliant we need to emulate ns-desc-list
+ * command which is 1.3.0 compliant but not present for the passthru ctrl due
+ * to lower version.
+ */
+static void nvmet_passthru_emulate_id_desclist(struct nvmet_req *req)
+{
+ int nsid = le32_to_cpu(req->cmd->common.nsid);
+ u16 status = NVME_SC_SUCCESS;
+ struct nvme_ns_ids *ids;
+ struct nvme_ns *ns;
+ off_t off = 0;
+
+ ns = nvme_find_get_ns(nvmet_req_passthru_ctrl(req), nsid);
+ if (unlikely(!ns)) {
+ pr_err("failed to get passthru ns nsid:%u\n", nsid);
+ status = NVME_SC_INVALID_NS | NVME_SC_DNR;
+ goto out;
+ }
+ /*
+ * Instead of refactoring and creating helpers, keep it simple and
+ * just re-use the code from admin-cmd.c ->
+ * nvmet_execute_identify_ns_desclist().
+ */
+ ids = &ns->head->ids;
+ if (memchr_inv(ids->eui64, 0, sizeof(ids->eui64))) {
+ status = nvmet_copy_ns_identifier(req, NVME_NIDT_EUI64,
+ NVME_NIDT_EUI64_LEN,
+ &ids->eui64, &off);
+ if (status)
+ goto out_put_ns;
+ }
+ if (memchr_inv(&ids->uuid, 0, sizeof(ids->uuid))) {
+ status = nvmet_copy_ns_identifier(req, NVME_NIDT_UUID,
+ NVME_NIDT_UUID_LEN,
+ &ids->uuid, &off);
+ if (status)
+ goto out_put_ns;
+ }
+ if (memchr_inv(ids->nguid, 0, sizeof(ids->nguid))) {
+ status = nvmet_copy_ns_identifier(req, NVME_NIDT_NGUID,
+ NVME_NIDT_NGUID_LEN,
+ &ids->nguid, &off);
+ if (status)
+ goto out_put_ns;
+ }
+
+ if (sg_zero_buffer(req->sg, req->sg_cnt, NVME_IDENTIFY_DATA_SIZE - off,
+ off) != NVME_IDENTIFY_DATA_SIZE - off)
+ status = NVME_SC_INTERNAL | NVME_SC_DNR;
+out_put_ns:
+ nvme_put_ns(ns);
+out:
+ nvmet_req_complete(req, status);
+}
+
+/*
+ * In the passthru mode we support three types for commands:-
+ * 1. Commands which are black-listed.
+ * 2. Commands which are routed through target code.
+ * 3. Commands which are emulated in the target code, since we can't rely
+ * on passthru-ctrl and cannot route through the target code.
+ */
+static u16 nvmet_parse_passthru_admin_cmd(struct nvmet_req *req)
+{
+ struct nvme_command *cmd = req->cmd;
+ u16 status = 0;
+
+ switch (cmd->common.opcode) {
+ /* 1. commands which are blacklisted */
+ case nvme_admin_create_sq:
+ case nvme_admin_create_cq:
+ case nvme_admin_delete_sq:
+ case nvme_admin_delete_cq:
+ case nvme_admin_activate_fw:
+ case nvme_admin_download_fw:
+ case nvme_admin_ns_attach:
+ case nvme_admin_directive_send:
+ case nvme_admin_directive_recv:
+ case nvme_admin_dbbuf:
+ case nvme_admin_security_send:
+ case nvme_admin_security_recv:
+ status = NVME_SC_INVALID_OPCODE | NVME_SC_DNR;
+ req->execute = NULL;
+ break;
+ /* 2. commands which are routed through target code */
+ case nvme_admin_async_event:
+ /*
+ * Right now we don't monitor any events for the passthru controller.
+ * Instead generate asyn event notice for the ns-mgmt/format/attach
+ * commands so that host can update it's ns-inventory.
+ */
+ /* fallthru */
+ case nvme_admin_keep_alive:
+ /*
+ * Most PCIe ctrls don't support keep alive cmd, we route keep alive
+ * to the non-passthru mode. In future please change this code when
+ * PCIe ctrls with keep alive support available.
+ */
+ status = nvmet_parse_admin_cmd(req);
+ break;
+ case nvme_admin_set_features:
+ switch (le32_to_cpu(req->cmd->features.fid)) {
+ case NVME_FEAT_ASYNC_EVENT:
+ case NVME_FEAT_KATO:
+ case NVME_FEAT_NUM_QUEUES:
+ status = nvmet_parse_admin_cmd(req);
+ break;
+ default:
+ req->execute = nvmet_passthru_execute_cmd;
+ }
+ break;
+ /* 3. commands which are emulated in the passthru code */
+ case nvme_admin_identify:
+ switch (req->cmd->identify.cns) {
+ case NVME_ID_CNS_NS_DESC_LIST:
+ req->execute = nvmet_passthru_emulate_id_desclist;
+ break;
+ default:
+ req->execute = nvmet_passthru_execute_cmd;
+ }
+ break;
+ default:
+ /*
+ * We passthru all the remaining commands, including
+ * Vendor-Unique Commands
+ */
+ req->execute = nvmet_passthru_execute_cmd;
+ }
+
+ return status;
+}
+
+u16 nvmet_parse_passthru_cmd(struct nvmet_req *req)
+{
+ int ret;
+
+ if (unlikely(req->cmd->common.opcode == nvme_fabrics_command))
+ return nvmet_parse_fabrics_cmd(req);
+ else if (unlikely(req->sq->ctrl->subsys->type == NVME_NQN_DISC))
+ return nvmet_parse_discovery_cmd(req);
+
+ ret = nvmet_check_ctrl_status(req, req->cmd);
+ if (unlikely(ret))
+ return ret;
+
+ if (unlikely(req->sq->qid == 0))
+ return nvmet_parse_passthru_admin_cmd(req);
+
+ req->execute = nvmet_passthru_execute_cmd;
+ return NVME_SC_SUCCESS;
+}
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index d1a0a3190a24..bd11114ebbb9 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -227,6 +227,10 @@ struct nvmet_subsys {

struct config_group namespaces_group;
struct config_group allowed_hosts_group;
+
+#ifdef CONFIG_NVME_TARGET_PASSTHRU
+ struct nvme_ctrl *passthru_ctrl;
+#endif /* CONFIG_NVME_TARGET_PASSTHRU */
};

static inline struct nvmet_subsys *to_subsys(struct config_item *item)
@@ -302,6 +306,10 @@ struct nvmet_req {
struct bio_vec *bvec;
struct work_struct work;
} f;
+ struct {
+ struct request *rq;
+ struct work_struct work;
+ } p;
};
int sg_cnt;
/* data length as parsed from the command: */
@@ -497,6 +505,44 @@ static inline u32 nvmet_rw_len(struct nvmet_req *req)
req->ns->blksize_shift;
}

+#ifdef CONFIG_NVME_TARGET_PASSTHRU
+
+int nvmet_passthru_init(void);
+void nvmet_passthru_destroy(void);
+u16 nvmet_parse_passthru_cmd(struct nvmet_req *req);
+
+static inline
+struct nvme_ctrl *nvmet_passthru_ctrl(struct nvmet_subsys *subsys)
+{
+ return subsys->passthru_ctrl;
+}
+
+#else /* CONFIG_NVME_TARGET_PASSTHRU */
+
+static inline int nvmet_passthru_init(void)
+{
+ return 0;
+}
+static inline void nvmet_passthru_destroy(void)
+{
+}
+static inline u16 nvmet_parse_passthru_cmd(struct nvmet_req *req)
+{
+ return 0;
+}
+static inline
+struct nvme_ctrl *nvmet_passthru_ctrl(struct nvmet_subsys *subsys)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_NVME_TARGET_PASSTHRU */
+
+static inline struct nvme_ctrl *nvmet_req_passthru_ctrl(struct nvmet_req *req)
+{
+ return nvmet_passthru_ctrl(req->sq->ctrl->subsys);
+}
+
u16 errno_to_nvme_status(struct nvmet_req *req, int errno);

/* Convert a 32-bit number to a 16-bit 0's based number */
--
2.20.1


2019-07-25 18:13:15

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 05/16] nvme-core: export existing ctrl and ns interfaces

From: Chaitanya Kulkarni <[email protected]>

We export existing nvme ctrl and ns management APIs so that target
passthru code can manage the nvme ctrl.

This is a preparation patch for implementing NVMe Over Fabric target
passthru feature.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/core.c | 17 +++++++++++------
drivers/nvme/host/nvme.h | 7 +++++++
2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 61e7b016ec36..f9642e26aafc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -412,10 +412,11 @@ static void nvme_free_ns(struct kref *kref)
kfree(ns);
}

-static void nvme_put_ns(struct nvme_ns *ns)
+void nvme_put_ns(struct nvme_ns *ns)
{
kref_put(&ns->kref, nvme_free_ns);
}
+EXPORT_SYMBOL_GPL(nvme_put_ns);

static inline void nvme_clear_nvme_request(struct request *req)
{
@@ -1088,7 +1089,7 @@ static int nvme_identify_ns_list(struct nvme_ctrl *dev, unsigned nsid, __le32 *n
NVME_IDENTIFY_DATA_SIZE);
}

-static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
+struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
unsigned nsid)
{
struct nvme_id_ns *id;
@@ -1113,6 +1114,7 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,

return id;
}
+EXPORT_SYMBOL_GPL(nvme_identify_ns);

static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid,
unsigned int dword11, void *buffer, size_t buflen, u32 *result)
@@ -1261,8 +1263,8 @@ static u32 nvme_known_admin_effects(u8 opcode)
return 0;
}

-static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
- u8 opcode)
+u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+ u8 opcode)
{
u32 effects = 0;

@@ -1291,6 +1293,7 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
}
return effects;
}
+EXPORT_SYMBOL_GPL(nvme_passthru_start);

static void nvme_update_formats(struct nvme_ctrl *ctrl)
{
@@ -1305,7 +1308,7 @@ static void nvme_update_formats(struct nvme_ctrl *ctrl)
nvme_remove_invalid_namespaces(ctrl, NVME_NSID_ALL);
}

-static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
+void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
{
/*
* Revalidate LBA changes prior to unfreezing. This is necessary to
@@ -1323,6 +1326,7 @@ static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
if (effects & (NVME_CMD_EFFECTS_NIC | NVME_CMD_EFFECTS_NCC))
nvme_queue_scan(ctrl);
}
+EXPORT_SYMBOL_GPL(nvme_passthru_end);

static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
struct nvme_passthru_cmd __user *ucmd)
@@ -3263,7 +3267,7 @@ static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
return nsa->head->ns_id - nsb->head->ns_id;
}

-static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
struct nvme_ns *ns, *ret = NULL;

@@ -3281,6 +3285,7 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
up_read(&ctrl->namespaces_rwsem);
return ret;
}
+EXPORT_SYMBOL_GPL(nvme_find_get_ns);

static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
{
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a19bd0091de6..ac0a8b896861 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -436,6 +436,8 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl);
void nvme_stop_ctrl(struct nvme_ctrl *ctrl);
void nvme_put_ctrl(struct nvme_ctrl *ctrl);
int nvme_init_identify(struct nvme_ctrl *ctrl);
+struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned int nsid);
+void nvme_put_ns(struct nvme_ns *ns);

void nvme_remove_namespaces(struct nvme_ctrl *ctrl);

@@ -472,8 +474,13 @@ int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
unsigned int dword11, void *buffer, size_t buflen,
u32 *result);
+u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+ u8 opcode);
+void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects);
int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
+struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
+ unsigned int nsid);
int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl);
int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
--
2.20.1


2019-07-25 18:13:34

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 15/16] block: call blk_account_io_start() in blk_execute_rq_nowait()

All existing users of blk_execute_rq[_nowait]() are for passthrough
commands and will thus be rejected by blk_do_io_stat().

This allows passthrough requests to opt-in to IO accounting by setting
RQF_IO_STAT.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
block/blk-exec.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/block/blk-exec.c b/block/blk-exec.c
index 1db44ca0f4a6..e20a852ae432 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -55,6 +55,8 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
rq->rq_disk = bd_disk;
rq->end_io = done;

+ blk_account_io_start(rq, true);
+
/*
* don't check dying flag for MQ because the request won't
* be reused after dying flag is set
--
2.20.1


2019-07-25 18:13:44

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

cdev_get_by_path() attempts to retrieve a struct cdev from
a path name. It is analagous to blkdev_get_by_path().

This will be necessary to create a nvme_ctrl_get_by_path()to
support NVMe-OF passthru.

Signed-off-by: Logan Gunthorpe <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Alexander Viro <[email protected]>
---
fs/char_dev.c | 34 ++++++++++++++++++++++++++++++++++
include/linux/cdev.h | 1 +
2 files changed, 35 insertions(+)

diff --git a/fs/char_dev.c b/fs/char_dev.c
index 5cc53bd5f654..25037d642ff8 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -22,6 +22,7 @@
#include <linux/mutex.h>
#include <linux/backing-dev.h>
#include <linux/tty.h>
+#include <linux/namei.h>

#include "internal.h"

@@ -403,6 +404,38 @@ static struct cdev *cdev_lookup(struct inode *inode)
return p;
}

+struct cdev *cdev_get_by_path(const char *pathname)
+{
+ struct inode *inode;
+ struct cdev *cdev;
+ struct path path;
+ int error;
+
+ if (!pathname || !*pathname)
+ return ERR_PTR(-EINVAL);
+
+ error = kern_path(pathname, LOOKUP_FOLLOW, &path);
+ if (error)
+ return ERR_PTR(error);
+
+ if (!may_open_dev(&path)) {
+ cdev = ERR_PTR(-EACCES);
+ goto out;
+ }
+
+ inode = d_backing_inode(path.dentry);
+ if (!S_ISCHR(inode->i_mode)) {
+ cdev = ERR_PTR(-EINVAL);
+ goto out;
+ }
+
+ cdev = cdev_lookup(inode);
+
+out:
+ path_put(&path);
+ return cdev;
+}
+
/*
* Called every time a character special file is opened
*/
@@ -690,5 +723,6 @@ EXPORT_SYMBOL(cdev_add);
EXPORT_SYMBOL(cdev_set_parent);
EXPORT_SYMBOL(cdev_device_add);
EXPORT_SYMBOL(cdev_device_del);
+EXPORT_SYMBOL(cdev_get_by_path);
EXPORT_SYMBOL(__register_chrdev);
EXPORT_SYMBOL(__unregister_chrdev);
diff --git a/include/linux/cdev.h b/include/linux/cdev.h
index 0e8cd6293deb..c7f2df2ca69a 100644
--- a/include/linux/cdev.h
+++ b/include/linux/cdev.h
@@ -24,6 +24,7 @@ void cdev_init(struct cdev *, const struct file_operations *);

struct cdev *cdev_alloc(void);

+struct cdev *cdev_get_by_path(const char *pathname);
void cdev_put(struct cdev *p);

int cdev_add(struct cdev *, dev_t, unsigned);
--
2.20.1


2019-07-25 18:14:00

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v6 01/16] chardev: factor out cdev_lookup() helper

Create a new helper out of the code in chrdev_open() which looks up
a struct cdev from a struct inode.

This will be necessary to create a cdev_get_by_path() which is
similar to blkdev_get_by_path() and is required to allow NVMe-OF
passthru to lookup an NVMe controller cdev from a path.

Signed-off-by: Logan Gunthorpe <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Alexander Viro <[email protected]>
---
fs/char_dev.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/char_dev.c b/fs/char_dev.c
index 00dfe17871ac..5cc53bd5f654 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -367,12 +367,8 @@ void cdev_put(struct cdev *p)
}
}

-/*
- * Called every time a character special file is opened
- */
-static int chrdev_open(struct inode *inode, struct file *filp)
+static struct cdev *cdev_lookup(struct inode *inode)
{
- const struct file_operations *fops;
struct cdev *p;
struct cdev *new = NULL;
int ret = 0;
@@ -385,7 +381,7 @@ static int chrdev_open(struct inode *inode, struct file *filp)
spin_unlock(&cdev_lock);
kobj = kobj_lookup(cdev_map, inode->i_rdev, &idx);
if (!kobj)
- return -ENXIO;
+ return ERR_PTR(-ENXIO);
new = container_of(kobj, struct cdev, kobj);
spin_lock(&cdev_lock);
/* Check i_cdev again in case somebody beat us to it while
@@ -402,7 +398,23 @@ static int chrdev_open(struct inode *inode, struct file *filp)
spin_unlock(&cdev_lock);
cdev_put(new);
if (ret)
- return ret;
+ return ERR_PTR(ret);
+
+ return p;
+}
+
+/*
+ * Called every time a character special file is opened
+ */
+static int chrdev_open(struct inode *inode, struct file *filp)
+{
+ const struct file_operations *fops;
+ struct cdev *p;
+ int ret = 0;
+
+ p = cdev_lookup(inode);
+ if (IS_ERR(p))
+ return PTR_ERR(p);

ret = -ENXIO;
fops = fops_get(p->ops);
--
2.20.1


2019-07-25 18:14:54

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] nvme-core: introduce nvme_get_by_path()

On Thu, Jul 25, 2019 at 11:23:23AM -0600, Logan Gunthorpe wrote:
> nvme_get_by_path() is analagous to blkdev_get_by_path() except it
> gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
>
> The purpose of this function is to support NVMe-OF target passthru.

I can't find anywhere that you use this in this patchset.

2019-07-25 18:14:56

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 11:23:21AM -0600, Logan Gunthorpe wrote:
> cdev_get_by_path() attempts to retrieve a struct cdev from
> a path name. It is analagous to blkdev_get_by_path().
>
> This will be necessary to create a nvme_ctrl_get_by_path()to
> support NVMe-OF passthru.

Ick, why? Why would a cdev have a "pathname"?

What is "NVMe-OF passthru"? Why does a char device node have anything
to do with NVMe?

We have way too many ways to abuse cdevs today, my long-term-wish has
always been to clean this interface up to make it more sane and unified,
and get rid of the "outliers" (all created at the time for a good
reason, that's not the problem.) But to add "just one more" seems
really odd to me.

thanks,

greg k-h

2019-07-25 18:15:11

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] nvme-core: introduce nvme_get_by_path()



On 2019-07-25 11:50 a.m., Matthew Wilcox wrote:
> On Thu, Jul 25, 2019 at 11:23:23AM -0600, Logan Gunthorpe wrote:
>> nvme_get_by_path() is analagous to blkdev_get_by_path() except it
>> gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
>>
>> The purpose of this function is to support NVMe-OF target passthru.
>
> I can't find anywhere that you use this in this patchset.
>

Oh sorry, the commit message is out of date the function was actually
called nvme_ctrl_get_by_path() and it's used in Patch 10.

I'll fix the commit message for the next version.

Logan

2019-07-25 18:15:29

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()



On 2019-07-25 11:40 a.m., Greg Kroah-Hartman wrote:
> On Thu, Jul 25, 2019 at 11:23:21AM -0600, Logan Gunthorpe wrote:
>> cdev_get_by_path() attempts to retrieve a struct cdev from
>> a path name. It is analagous to blkdev_get_by_path().
>>
>> This will be necessary to create a nvme_ctrl_get_by_path()to
>> support NVMe-OF passthru.
>
> Ick, why? Why would a cdev have a "pathname"?

So we can go from "/dev/nvme0" (which points to a char device) to its
struct cdev and eventually it's struct nvme_ctrl. Doing it this way also
allows supporting symlinks that might be created by udev rules.

This is very similar to blkdev_get_by_path() that lets regular NVMe-OF
obtain the struct block_device from a path.

I didn't think this would be all that controversial.

> What is "NVMe-OF passthru"? Why does a char device node have anything
> to do with NVMe?

NVME-OF passthru is support for NVME over fabrics to directly target a
regular NVMe controller and thus export an entire NVMe device to a
remote system. We need to be able to tell the kernel which controller to
use and IMO a path to the device file is the best way as it allows us to
support symlinks created by udev.

> We have way too many ways to abuse cdevs today, my long-term-wish has
> always been to clean this interface up to make it more sane and unified,
> and get rid of the "outliers" (all created at the time for a good
> reason, that's not the problem.) But to add "just one more" seems
> really odd to me.

Well it doesn't seem all that much like an outlier to me.

Logan

2019-07-25 18:15:53

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()



On 2019-07-25 11:58 a.m., Matthew Wilcox wrote:
> On Thu, Jul 25, 2019 at 11:53:20AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-07-25 11:40 a.m., Greg Kroah-Hartman wrote:
>>> On Thu, Jul 25, 2019 at 11:23:21AM -0600, Logan Gunthorpe wrote:
>>>> cdev_get_by_path() attempts to retrieve a struct cdev from
>>>> a path name. It is analagous to blkdev_get_by_path().
>>>>
>>>> This will be necessary to create a nvme_ctrl_get_by_path()to
>>>> support NVMe-OF passthru.
>>>
>>> Ick, why? Why would a cdev have a "pathname"?
>>
>> So we can go from "/dev/nvme0" (which points to a char device) to its
>> struct cdev and eventually it's struct nvme_ctrl. Doing it this way also
>> allows supporting symlinks that might be created by udev rules.
>
> But you're not really trying to go from a string to a chardev. You're
> trying to go from a nvmet_subsys to a chardev. Isn't there a better
> way to link the two somewhere else?
>
> (I must confess that once I would have known the answer to this, but
> the NVMe subsystem has grown ridiculously complex and I can no longer
> fit it in my head)

Well the nvmet_subsys isn't related to the nvme_ctrl (and thus char dev)
at all. An nvmet_subsys is created via configfs and the user has to
specify an NVMe controller for it to use (by writting a string to a
config attribute). The best handle the user has is a path to the
controller's cdev (/dev/nvmeX) so the fabrics code has to be able to
lookup the corresponding struct nvme_ctrl from the path.

This is directly analogous to the way NVMe-of works today: it uses
blkdev_get_by_path() to translate a user provided path to a struct
block_device. The only difference here is that, for passthru, we need a
nvme_ctrl, not a block device.

Logan

2019-07-25 18:16:00

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()



On 2019-07-25 12:08 p.m., Greg Kroah-Hartman wrote:
> On Thu, Jul 25, 2019 at 11:53:20AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-07-25 11:40 a.m., Greg Kroah-Hartman wrote:
>>> On Thu, Jul 25, 2019 at 11:23:21AM -0600, Logan Gunthorpe wrote:
>>>> cdev_get_by_path() attempts to retrieve a struct cdev from
>>>> a path name. It is analagous to blkdev_get_by_path().
>>>>
>>>> This will be necessary to create a nvme_ctrl_get_by_path()to
>>>> support NVMe-OF passthru.
>>>
>>> Ick, why? Why would a cdev have a "pathname"?
>>
>> So we can go from "/dev/nvme0" (which points to a char device) to its
>> struct cdev and eventually it's struct nvme_ctrl. Doing it this way also
>> allows supporting symlinks that might be created by udev rules.
>
> Why do you have a "string" within the kernel and are not using the
> normal open() call from userspace on the character device node on the
> filesystem in your namespace/mount/whatever?

NVMe-OF is configured using configfs. The target is specified by the
user writing a path to a configfs attribute. This is the way it works
today but with blkdev_get_by_path()[1]. For the passthru code, we need
to get a nvme_ctrl instead of a block_device, but the principal is the same.

> Where is this random string coming from?

configfs

> Why is this so special that no
> one else has ever needed it?

People have needed the same functionality for block devices and
blkdev_get_by_path() has multiple users (iscsi, drbd, nvme-of, etc)
which are doing similar things. Nobody has needed to do the same with a
chardev until we wanted the NVMe-of to support targeting an NVMe
controller which is represented in userspace by a char device.

Logan


[1]
https://elixir.bootlin.com/linux/latest/source/drivers/nvme/target/io-cmd-bdev.c#L15

2019-07-25 18:16:24

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 11:53:20AM -0600, Logan Gunthorpe wrote:
>
>
> On 2019-07-25 11:40 a.m., Greg Kroah-Hartman wrote:
> > On Thu, Jul 25, 2019 at 11:23:21AM -0600, Logan Gunthorpe wrote:
> >> cdev_get_by_path() attempts to retrieve a struct cdev from
> >> a path name. It is analagous to blkdev_get_by_path().
> >>
> >> This will be necessary to create a nvme_ctrl_get_by_path()to
> >> support NVMe-OF passthru.
> >
> > Ick, why? Why would a cdev have a "pathname"?
>
> So we can go from "/dev/nvme0" (which points to a char device) to its
> struct cdev and eventually it's struct nvme_ctrl. Doing it this way also
> allows supporting symlinks that might be created by udev rules.
>
> This is very similar to blkdev_get_by_path() that lets regular NVMe-OF
> obtain the struct block_device from a path.
>
> I didn't think this would be all that controversial.
>
> > What is "NVMe-OF passthru"? Why does a char device node have anything
> > to do with NVMe?
>
> NVME-OF passthru is support for NVME over fabrics to directly target a
> regular NVMe controller and thus export an entire NVMe device to a
> remote system. We need to be able to tell the kernel which controller to
> use and IMO a path to the device file is the best way as it allows us to
> support symlinks created by udev.

open() in userspace handles symlinks just fine, what crazy interface
passes a string to try to find a char device node that is not open()?

And why do you need a char device at all anyway? Is this just the
"normal" nvme controller's character device node?

> > We have way too many ways to abuse cdevs today, my long-term-wish has
> > always been to clean this interface up to make it more sane and unified,
> > and get rid of the "outliers" (all created at the time for a good
> > reason, that's not the problem.) But to add "just one more" seems
> > really odd to me.
>
> Well it doesn't seem all that much like an outlier to me.

Everyone is special, just like everyone else :)

Seriously, as no one else has ever needed this, you are an outlier.

thanks,

greg k-h

2019-07-25 18:16:31

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 11:53:20AM -0600, Logan Gunthorpe wrote:
>
>
> On 2019-07-25 11:40 a.m., Greg Kroah-Hartman wrote:
> > On Thu, Jul 25, 2019 at 11:23:21AM -0600, Logan Gunthorpe wrote:
> >> cdev_get_by_path() attempts to retrieve a struct cdev from
> >> a path name. It is analagous to blkdev_get_by_path().
> >>
> >> This will be necessary to create a nvme_ctrl_get_by_path()to
> >> support NVMe-OF passthru.
> >
> > Ick, why? Why would a cdev have a "pathname"?
>
> So we can go from "/dev/nvme0" (which points to a char device) to its
> struct cdev and eventually it's struct nvme_ctrl. Doing it this way also
> allows supporting symlinks that might be created by udev rules.

But you're not really trying to go from a string to a chardev. You're
trying to go from a nvmet_subsys to a chardev. Isn't there a better
way to link the two somewhere else?

(I must confess that once I would have known the answer to this, but
the NVMe subsystem has grown ridiculously complex and I can no longer
fit it in my head)

2019-07-25 18:17:01

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 11:53:20AM -0600, Logan Gunthorpe wrote:
>
>
> On 2019-07-25 11:40 a.m., Greg Kroah-Hartman wrote:
> > On Thu, Jul 25, 2019 at 11:23:21AM -0600, Logan Gunthorpe wrote:
> >> cdev_get_by_path() attempts to retrieve a struct cdev from
> >> a path name. It is analagous to blkdev_get_by_path().
> >>
> >> This will be necessary to create a nvme_ctrl_get_by_path()to
> >> support NVMe-OF passthru.
> >
> > Ick, why? Why would a cdev have a "pathname"?
>
> So we can go from "/dev/nvme0" (which points to a char device) to its
> struct cdev and eventually it's struct nvme_ctrl. Doing it this way also
> allows supporting symlinks that might be created by udev rules.

Why do you have a "string" within the kernel and are not using the
normal open() call from userspace on the character device node on the
filesystem in your namespace/mount/whatever?

Where is this random string coming from? Why is this so special that no
one else has ever needed it?

thanks,

greg k-h

2019-07-25 18:18:20

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()



On 2019-07-25 12:10 p.m., Greg Kroah-Hartman wrote:
> On Thu, Jul 25, 2019 at 11:53:20AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-07-25 11:40 a.m., Greg Kroah-Hartman wrote:
>>> On Thu, Jul 25, 2019 at 11:23:21AM -0600, Logan Gunthorpe wrote:
>>>> cdev_get_by_path() attempts to retrieve a struct cdev from
>>>> a path name. It is analagous to blkdev_get_by_path().
>>>>
>>>> This will be necessary to create a nvme_ctrl_get_by_path()to
>>>> support NVMe-OF passthru.
>>>
>>> Ick, why? Why would a cdev have a "pathname"?
>>
>> So we can go from "/dev/nvme0" (which points to a char device) to its
>> struct cdev and eventually it's struct nvme_ctrl. Doing it this way also
>> allows supporting symlinks that might be created by udev rules.
>>
>> This is very similar to blkdev_get_by_path() that lets regular NVMe-OF
>> obtain the struct block_device from a path.
>>
>> I didn't think this would be all that controversial.
>>
>>> What is "NVMe-OF passthru"? Why does a char device node have anything
>>> to do with NVMe?
>>
>> NVME-OF passthru is support for NVME over fabrics to directly target a
>> regular NVMe controller and thus export an entire NVMe device to a
>> remote system. We need to be able to tell the kernel which controller to
>> use and IMO a path to the device file is the best way as it allows us to
>> support symlinks created by udev.
>
> open() in userspace handles symlinks just fine, what crazy interface
> passes a string to try to find a char device node that is not open()?

configfs. Which I'm stuck with seeing nvme-of already uses that for
configuration and I don't think that's going to change...

> And why do you need a char device at all anyway? Is this just the
> "normal" nvme controller's character device node?

Yes.

Logan

2019-07-25 18:29:47

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 12:14:33PM -0600, Logan Gunthorpe wrote:
>
>
> On 2019-07-25 12:08 p.m., Greg Kroah-Hartman wrote:
> > On Thu, Jul 25, 2019 at 11:53:20AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-07-25 11:40 a.m., Greg Kroah-Hartman wrote:
> >>> On Thu, Jul 25, 2019 at 11:23:21AM -0600, Logan Gunthorpe wrote:
> >>>> cdev_get_by_path() attempts to retrieve a struct cdev from
> >>>> a path name. It is analagous to blkdev_get_by_path().
> >>>>
> >>>> This will be necessary to create a nvme_ctrl_get_by_path()to
> >>>> support NVMe-OF passthru.
> >>>
> >>> Ick, why? Why would a cdev have a "pathname"?
> >>
> >> So we can go from "/dev/nvme0" (which points to a char device) to its
> >> struct cdev and eventually it's struct nvme_ctrl. Doing it this way also
> >> allows supporting symlinks that might be created by udev rules.
> >
> > Why do you have a "string" within the kernel and are not using the
> > normal open() call from userspace on the character device node on the
> > filesystem in your namespace/mount/whatever?
>
> NVMe-OF is configured using configfs. The target is specified by the
> user writing a path to a configfs attribute. This is the way it works
> today but with blkdev_get_by_path()[1]. For the passthru code, we need
> to get a nvme_ctrl instead of a block_device, but the principal is the same.

Why isn't a fd being passed in there instead of a random string?

Seems odd, but oh well, that ship sailed a long time ago for block
devices I guess.

So what do you actually _do_ with that char device once you have it?

greg k-h

2019-07-25 18:38:10

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()



On 2019-07-25 12:27 p.m., Greg Kroah-Hartman wrote:
>>> Why do you have a "string" within the kernel and are not using the
>>> normal open() call from userspace on the character device node on the
>>> filesystem in your namespace/mount/whatever?
>>
>> NVMe-OF is configured using configfs. The target is specified by the
>> user writing a path to a configfs attribute. This is the way it works
>> today but with blkdev_get_by_path()[1]. For the passthru code, we need
>> to get a nvme_ctrl instead of a block_device, but the principal is the same.
>
> Why isn't a fd being passed in there instead of a random string?

I wouldn't know the answer to this but I assume because once we decided
to use configfs, there was no way for the user to pass the kernel an fd.

> Seems odd, but oh well, that ship sailed a long time ago for block
> devices I guess.

Yup.

> So what do you actually _do_ with that char device once you have it?

We lookup the struct nvme_ctrl and use it to submit passed-through NVMe
commands directly to the controller.

Logan

2019-07-25 19:02:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 08:27:01PM +0200, Greg Kroah-Hartman wrote:
> > NVMe-OF is configured using configfs. The target is specified by the
> > user writing a path to a configfs attribute. This is the way it works
> > today but with blkdev_get_by_path()[1]. For the passthru code, we need
> > to get a nvme_ctrl instead of a block_device, but the principal is the same.
>
> Why isn't a fd being passed in there instead of a random string?

I suppose we could echo a string of the file descriptor number there,
and look up the fd in the process' file descriptor table ...

I'll get my coat.


2019-07-25 19:03:58

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()


>>>> Why do you have a "string" within the kernel and are not using the
>>>> normal open() call from userspace on the character device node on the
>>>> filesystem in your namespace/mount/whatever?
>>>
>>> NVMe-OF is configured using configfs. The target is specified by the
>>> user writing a path to a configfs attribute. This is the way it works
>>> today but with blkdev_get_by_path()[1]. For the passthru code, we need
>>> to get a nvme_ctrl instead of a block_device, but the principal is the same.
>>
>> Why isn't a fd being passed in there instead of a random string?
>
> I wouldn't know the answer to this but I assume because once we decided
> to use configfs, there was no way for the user to pass the kernel an fd.

That's definitely not changing. But this is not different than how we
use the block device or file configuration, this just happen to need the
nvme controller chardev now to issue I/O.

2019-07-25 19:06:49

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()


>>> NVMe-OF is configured using configfs. The target is specified by the
>>> user writing a path to a configfs attribute. This is the way it works
>>> today but with blkdev_get_by_path()[1]. For the passthru code, we need
>>> to get a nvme_ctrl instead of a block_device, but the principal is the same.
>>
>> Why isn't a fd being passed in there instead of a random string?
>
> I suppose we could echo a string of the file descriptor number there,
> and look up the fd in the process' file descriptor table ...

Assuming that there is a open handle somewhere out there...

> I'll get my coat.

Grab mine too..

2019-07-25 19:13:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 12:05:29PM -0700, Sagi Grimberg wrote:
>
> > > > NVMe-OF is configured using configfs. The target is specified by the
> > > > user writing a path to a configfs attribute. This is the way it works
> > > > today but with blkdev_get_by_path()[1]. For the passthru code, we need
> > > > to get a nvme_ctrl instead of a block_device, but the principal is the same.
> > >
> > > Why isn't a fd being passed in there instead of a random string?
> >
> > I suppose we could echo a string of the file descriptor number there,
> > and look up the fd in the process' file descriptor table ...
>
> Assuming that there is a open handle somewhere out there...

Well, that's how we'd know that the application echoing /dev/nvme3 into
configfs actually has permission to access /dev/nvme3. Think about
containers, for example. It's not exactly safe to mount configfs in a
non-root container since it can access any NVMe device in the system,
not just ones which it's been given permission to access. Right?

2019-07-25 19:26:40

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()



On 2019-07-25 1:11 p.m., Matthew Wilcox wrote:
> On Thu, Jul 25, 2019 at 12:05:29PM -0700, Sagi Grimberg wrote:
>>
>>>>> NVMe-OF is configured using configfs. The target is specified by the
>>>>> user writing a path to a configfs attribute. This is the way it works
>>>>> today but with blkdev_get_by_path()[1]. For the passthru code, we need
>>>>> to get a nvme_ctrl instead of a block_device, but the principal is the same.
>>>>
>>>> Why isn't a fd being passed in there instead of a random string?
>>>
>>> I suppose we could echo a string of the file descriptor number there,
>>> and look up the fd in the process' file descriptor table ...
>>
>> Assuming that there is a open handle somewhere out there...

Yes, that would be a step backwards from an interface. The user would
then need a special process to open the fd and pass it through configfs.
They couldn't just do it with basic bash commands.

> Well, that's how we'd know that the application echoing /dev/nvme3 into
> configfs actually has permission to access /dev/nvme3.

It's the kernel that's accessing the device so it has permission. root
permission is required to configure the kernel.

> Think about
> containers, for example. It's not exactly safe to mount configfs in a
> non-root container since it can access any NVMe device in the system,
> not just ones which it's been given permission to access. Right?

I don't think it really makes any sense to talk about NVMe-of and
containers. Though, if we did it would be solely on the configuration
interface so that users inside a container might be able to configure a
new target for resources they can see and they'd have to have their own
view into configfs....

Logan


2019-07-25 19:28:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 01:24:22PM -0600, Logan Gunthorpe wrote:
> >> Assuming that there is a open handle somewhere out there...
>
> Yes, that would be a step backwards from an interface. The user would
> then need a special process to open the fd and pass it through configfs.
> They couldn't just do it with basic bash commands.

echo 3 3</dev/nvme3 >/configfs/foor/bar/whatever


2019-07-25 19:32:17

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()


>>>>> NVMe-OF is configured using configfs. The target is specified by the
>>>>> user writing a path to a configfs attribute. This is the way it works
>>>>> today but with blkdev_get_by_path()[1]. For the passthru code, we need
>>>>> to get a nvme_ctrl instead of a block_device, but the principal is the same.
>>>>
>>>> Why isn't a fd being passed in there instead of a random string?
>>>
>>> I suppose we could echo a string of the file descriptor number there,
>>> and look up the fd in the process' file descriptor table ...
>>
>> Assuming that there is a open handle somewhere out there...
>
> Well, that's how we'd know that the application echoing /dev/nvme3 into
> configfs actually has permission to access /dev/nvme3.

Actually, the application is exposing a target device to someone else,
its actually preferable that it doesn't have access to it as its
possibly can create a consistency hole, but that is usually a root
application anyways... We could verify at least that though..

> Think about
> containers, for example. It's not exactly safe to mount configfs in a
> non-root container since it can access any NVMe device in the system,
> not just ones which it's been given permission to access. Right?

I'm seeing this as an equivalent to an application that is binding a
socket to an ip address, and the kernel looks-up according to the net
namespace that the socket has.

I do agree this is an area that was never really thought of. But what
you are describing requires infrastructure around it instead of forcing
the user to pass an fd to validate around it.

2019-07-25 19:32:35

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()



On 2019-07-25 1:26 p.m., Matthew Wilcox wrote:
> On Thu, Jul 25, 2019 at 01:24:22PM -0600, Logan Gunthorpe wrote:
>>>> Assuming that there is a open handle somewhere out there...
>>
>> Yes, that would be a step backwards from an interface. The user would
>> then need a special process to open the fd and pass it through configfs.
>> They couldn't just do it with basic bash commands.
>
> echo 3 3</dev/nvme3 >/configfs/foor/bar/whatever

Neat. I didn't know you could do that.

Logan

2019-07-25 19:35:09

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 12:02:30PM -0700, Sagi Grimberg wrote:
>
> > > > > Why do you have a "string" within the kernel and are not using the
> > > > > normal open() call from userspace on the character device node on the
> > > > > filesystem in your namespace/mount/whatever?
> > > >
> > > > NVMe-OF is configured using configfs. The target is specified by the
> > > > user writing a path to a configfs attribute. This is the way it works
> > > > today but with blkdev_get_by_path()[1]. For the passthru code, we need
> > > > to get a nvme_ctrl instead of a block_device, but the principal is the same.
> > >
> > > Why isn't a fd being passed in there instead of a random string?
> >
> > I wouldn't know the answer to this but I assume because once we decided
> > to use configfs, there was no way for the user to pass the kernel an fd.
>
> That's definitely not changing. But this is not different than how we
> use the block device or file configuration, this just happen to need the
> nvme controller chardev now to issue I/O.

So, as was kind of alluded to in another part of the thread, what are
you doing about permissions? It seems that any user/group permissions
are out the window when you have the kernel itself do the opening of the
char device, right? Why is that ok? You can pass it _any_ character
device node and away it goes? What if you give it a "wrong" one? Char
devices are very different from block devices this way.

thanks,

greg k-h

2019-07-25 19:38:18

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()


>>>>>> Why do you have a "string" within the kernel and are not using the
>>>>>> normal open() call from userspace on the character device node on the
>>>>>> filesystem in your namespace/mount/whatever?
>>>>>
>>>>> NVMe-OF is configured using configfs. The target is specified by the
>>>>> user writing a path to a configfs attribute. This is the way it works
>>>>> today but with blkdev_get_by_path()[1]. For the passthru code, we need
>>>>> to get a nvme_ctrl instead of a block_device, but the principal is the same.
>>>>
>>>> Why isn't a fd being passed in there instead of a random string?
>>>
>>> I wouldn't know the answer to this but I assume because once we decided
>>> to use configfs, there was no way for the user to pass the kernel an fd.
>>
>> That's definitely not changing. But this is not different than how we
>> use the block device or file configuration, this just happen to need the
>> nvme controller chardev now to issue I/O.
>
> So, as was kind of alluded to in another part of the thread, what are
> you doing about permissions? It seems that any user/group permissions
> are out the window when you have the kernel itself do the opening of the
> char device, right? Why is that ok? You can pass it _any_ character
> device node and away it goes? What if you give it a "wrong" one? Char
> devices are very different from block devices this way.

We could condition any configfs operation on capable(CAP_NET_ADMIN) to
close that hole for now..

2019-07-25 19:42:19

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()



On 2019-07-25 1:34 p.m., Greg Kroah-Hartman wrote:
> On Thu, Jul 25, 2019 at 12:02:30PM -0700, Sagi Grimberg wrote:
>>
>>>>>> Why do you have a "string" within the kernel and are not using the
>>>>>> normal open() call from userspace on the character device node on the
>>>>>> filesystem in your namespace/mount/whatever?
>>>>>
>>>>> NVMe-OF is configured using configfs. The target is specified by the
>>>>> user writing a path to a configfs attribute. This is the way it works
>>>>> today but with blkdev_get_by_path()[1]. For the passthru code, we need
>>>>> to get a nvme_ctrl instead of a block_device, but the principal is the same.
>>>>
>>>> Why isn't a fd being passed in there instead of a random string?
>>>
>>> I wouldn't know the answer to this but I assume because once we decided
>>> to use configfs, there was no way for the user to pass the kernel an fd.
>>
>> That's definitely not changing. But this is not different than how we
>> use the block device or file configuration, this just happen to need the
>> nvme controller chardev now to issue I/O.
>
> So, as was kind of alluded to in another part of the thread, what are
> you doing about permissions? It seems that any user/group permissions
> are out the window when you have the kernel itself do the opening of the
> char device, right? Why is that ok? You can pass it _any_ character
> device node and away it goes? What if you give it a "wrong" one? Char
> devices are very different from block devices this way.

Well the permission question is no different from the block-device case
we already have. The user has to be root to configure a target so it has
access to the block/char device. Containers and NVMe-of are really not
designed to mix and would take a lot of work to make this make any sense
(And that's way out of scope of what I'm trying to do here and doesn't
change the need for a the cdev_get_by_path()).

If the user specifies a non-nvme char device, it is rejected by the code
in nvme_ctrl_get_by_path() when it compares the fops. See patch 4.

Logan

2019-07-25 19:44:24

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 12:37:11PM -0700, Sagi Grimberg wrote:
>
> > > > > > > Why do you have a "string" within the kernel and are not using the
> > > > > > > normal open() call from userspace on the character device node on the
> > > > > > > filesystem in your namespace/mount/whatever?
> > > > > >
> > > > > > NVMe-OF is configured using configfs. The target is specified by the
> > > > > > user writing a path to a configfs attribute. This is the way it works
> > > > > > today but with blkdev_get_by_path()[1]. For the passthru code, we need
> > > > > > to get a nvme_ctrl instead of a block_device, but the principal is the same.
> > > > >
> > > > > Why isn't a fd being passed in there instead of a random string?
> > > >
> > > > I wouldn't know the answer to this but I assume because once we decided
> > > > to use configfs, there was no way for the user to pass the kernel an fd.
> > >
> > > That's definitely not changing. But this is not different than how we
> > > use the block device or file configuration, this just happen to need the
> > > nvme controller chardev now to issue I/O.
> >
> > So, as was kind of alluded to in another part of the thread, what are
> > you doing about permissions? It seems that any user/group permissions
> > are out the window when you have the kernel itself do the opening of the
> > char device, right? Why is that ok? You can pass it _any_ character
> > device node and away it goes? What if you give it a "wrong" one? Char
> > devices are very different from block devices this way.
>
> We could condition any configfs operation on capable(CAP_NET_ADMIN) to
> close that hole for now..

Why that specific permission?

And what about the "pass any random char device name" issue? What
happens if you pass /dev/random/ as the string?

thanks,

greg k-h

2019-07-25 19:45:26

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()


>> So, as was kind of alluded to in another part of the thread, what are
>> you doing about permissions?  It seems that any user/group permissions
>> are out the window when you have the kernel itself do the opening of the
>> char device, right?  Why is that ok?  You can pass it _any_ character
>> device node and away it goes?  What if you give it a "wrong" one?  Char
>> devices are very different from block devices this way.
>
> We could condition any configfs operation on capable(CAP_NET_ADMIN) to
> close that hole for now..

s/NET/SYS/...

2019-07-25 19:46:10

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()


>>> So, as was kind of alluded to in another part of the thread, what are
>>> you doing about permissions? It seems that any user/group permissions
>>> are out the window when you have the kernel itself do the opening of the
>>> char device, right? Why is that ok? You can pass it _any_ character
>>> device node and away it goes? What if you give it a "wrong" one? Char
>>> devices are very different from block devices this way.
>>
>> We could condition any configfs operation on capable(CAP_NET_ADMIN) to
>> close that hole for now..
>
> Why that specific permission?

Meant CAP_SYS_ADMIN

> And what about the "pass any random char device name" issue? What
> happens if you pass /dev/random/ as the string?

What is the difference if the application is opening the device if
it has the wrong path?

2019-07-25 20:02:59

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] nvme-core: introduce nvme_get_by_path()

On Thu, Jul 25, 2019 at 11:54:18AM -0600, Logan Gunthorpe wrote:
>
>
> On 2019-07-25 11:50 a.m., Matthew Wilcox wrote:
> > On Thu, Jul 25, 2019 at 11:23:23AM -0600, Logan Gunthorpe wrote:
> >> nvme_get_by_path() is analagous to blkdev_get_by_path() except it
> >> gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
> >>
> >> The purpose of this function is to support NVMe-OF target passthru.
> >
> > I can't find anywhere that you use this in this patchset.
> >
>
> Oh sorry, the commit message is out of date the function was actually
> called nvme_ctrl_get_by_path() and it's used in Patch 10.

Instead of by path, could we have configfs take something else, like
the unique controller instance or serial number? I know that's different
than how we handle blocks and files, but that way nvme core can lookup
the cooresponding controller without adding new cdev dependencies.

2019-07-25 20:14:55

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] nvme-core: introduce nvme_get_by_path()


>>>> nvme_get_by_path() is analagous to blkdev_get_by_path() except it
>>>> gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
>>>>
>>>> The purpose of this function is to support NVMe-OF target passthru.
>>>
>>> I can't find anywhere that you use this in this patchset.
>>>
>>
>> Oh sorry, the commit message is out of date the function was actually
>> called nvme_ctrl_get_by_path() and it's used in Patch 10.
>
> Instead of by path, could we have configfs take something else, like
> the unique controller instance or serial number? I know that's different
> than how we handle blocks and files, but that way nvme core can lookup
> the cooresponding controller without adding new cdev dependencies.

We could... but did we find sufficient justification to have the user
handle passthru devices differently than any other backend?
once we commit to an interface its very hard to change.


2019-07-25 20:29:31

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] nvme-core: introduce nvme_get_by_path()



On 2019-07-25 1:58 p.m., Keith Busch wrote:
> On Thu, Jul 25, 2019 at 11:54:18AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-07-25 11:50 a.m., Matthew Wilcox wrote:
>>> On Thu, Jul 25, 2019 at 11:23:23AM -0600, Logan Gunthorpe wrote:
>>>> nvme_get_by_path() is analagous to blkdev_get_by_path() except it
>>>> gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
>>>>
>>>> The purpose of this function is to support NVMe-OF target passthru.
>>>
>>> I can't find anywhere that you use this in this patchset.
>>>
>>
>> Oh sorry, the commit message is out of date the function was actually
>> called nvme_ctrl_get_by_path() and it's used in Patch 10.
>
> Instead of by path, could we have configfs take something else, like
> the unique controller instance or serial number? I know that's different
> than how we handle blocks and files, but that way nvme core can lookup
> the cooresponding controller without adding new cdev dependencies.

Well the previous version of the patchset just used the ctrl name
("nvme1") and looped through all the controllers to find a match. But
this sucks because of the inconsistency and the fact that the name can
change if hardware changes and the number changes. Allowing the user to
make use of standard udev rules seems important to me.

Logan


2019-07-25 20:35:17

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] nvme-core: introduce nvme_get_by_path()

On Thu, Jul 25, 2019 at 02:28:28PM -0600, Logan Gunthorpe wrote:
>
>
> On 2019-07-25 1:58 p.m., Keith Busch wrote:
> > On Thu, Jul 25, 2019 at 11:54:18AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-07-25 11:50 a.m., Matthew Wilcox wrote:
> >>> On Thu, Jul 25, 2019 at 11:23:23AM -0600, Logan Gunthorpe wrote:
> >>>> nvme_get_by_path() is analagous to blkdev_get_by_path() except it
> >>>> gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
> >>>>
> >>>> The purpose of this function is to support NVMe-OF target passthru.
> >>>
> >>> I can't find anywhere that you use this in this patchset.
> >>>
> >>
> >> Oh sorry, the commit message is out of date the function was actually
> >> called nvme_ctrl_get_by_path() and it's used in Patch 10.
> >
> > Instead of by path, could we have configfs take something else, like
> > the unique controller instance or serial number? I know that's different
> > than how we handle blocks and files, but that way nvme core can lookup
> > the cooresponding controller without adding new cdev dependencies.
>
> Well the previous version of the patchset just used the ctrl name
> ("nvme1") and looped through all the controllers to find a match. But
> this sucks because of the inconsistency and the fact that the name can
> change if hardware changes and the number changes. Allowing the user to
> make use of standard udev rules seems important to me.

Should we then create a new udev rule for persistent controller
names? /dev/nvme1 may not be the same controller each time you refer
to it.

2019-07-25 20:38:54

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] nvme-core: introduce nvme_get_by_path()



On 2019-07-25 2:31 p.m., Keith Busch wrote:
> On Thu, Jul 25, 2019 at 02:28:28PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-07-25 1:58 p.m., Keith Busch wrote:
>>> On Thu, Jul 25, 2019 at 11:54:18AM -0600, Logan Gunthorpe wrote:
>>>>
>>>>
>>>> On 2019-07-25 11:50 a.m., Matthew Wilcox wrote:
>>>>> On Thu, Jul 25, 2019 at 11:23:23AM -0600, Logan Gunthorpe wrote:
>>>>>> nvme_get_by_path() is analagous to blkdev_get_by_path() except it
>>>>>> gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
>>>>>>
>>>>>> The purpose of this function is to support NVMe-OF target passthru.
>>>>>
>>>>> I can't find anywhere that you use this in this patchset.
>>>>>
>>>>
>>>> Oh sorry, the commit message is out of date the function was actually
>>>> called nvme_ctrl_get_by_path() and it's used in Patch 10.
>>>
>>> Instead of by path, could we have configfs take something else, like
>>> the unique controller instance or serial number? I know that's different
>>> than how we handle blocks and files, but that way nvme core can lookup
>>> the cooresponding controller without adding new cdev dependencies.
>>
>> Well the previous version of the patchset just used the ctrl name
>> ("nvme1") and looped through all the controllers to find a match. But
>> this sucks because of the inconsistency and the fact that the name can
>> change if hardware changes and the number changes. Allowing the user to
>> make use of standard udev rules seems important to me.
>
> Should we then create a new udev rule for persistent controller
> names? /dev/nvme1 may not be the same controller each time you refer
> to it.

Udev can only create symlinks from /dev/nvme0 to
/dev/nvme-persistent-name and users can do this as they need now. No
changes needed.

My point was if we use the ctrl name (nvme0) as a reference then the
kernel can't make use of these symlinks or anything udev does seeing
that name is internal to the kernel only.

If we use cdev_get_by_path()/nvme_ctrl_get_by_path() then this isn't a
problem as we can open a symlink to /dev/nvme0 without any issues.

Logan


2019-07-25 23:56:45

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 01:24:22PM -0600, Logan Gunthorpe wrote:
>
>
> On 2019-07-25 1:11 p.m., Matthew Wilcox wrote:
> > On Thu, Jul 25, 2019 at 12:05:29PM -0700, Sagi Grimberg wrote:
> >>
> >>>>> NVMe-OF is configured using configfs. The target is specified by the
> >>>>> user writing a path to a configfs attribute. This is the way it works
> >>>>> today but with blkdev_get_by_path()[1]. For the passthru code, we need
> >>>>> to get a nvme_ctrl instead of a block_device, but the principal is the same.
> >>>>
> >>>> Why isn't a fd being passed in there instead of a random string?
> >>>
> >>> I suppose we could echo a string of the file descriptor number there,
> >>> and look up the fd in the process' file descriptor table ...
> >>
> >> Assuming that there is a open handle somewhere out there...
>
> Yes, that would be a step backwards from an interface. The user would
> then need a special process to open the fd and pass it through configfs.
> They couldn't just do it with basic bash commands.

First of all, they can, but... WTF not have filp_open() done right there?
Yes, by name. With permission checks done. And pick your object from the
sodding struct file you'll get.

What's the problem? Why do you need cdev lookups, etc., when you are
dealing with files under your full control? Just open them and use
->private_data or whatever you set in ->open() to access the damn thing.
All there is to it...

2019-07-26 04:31:40

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()


>>>>>>> NVMe-OF is configured using configfs. The target is specified by the
>>>>>>> user writing a path to a configfs attribute. This is the way it works
>>>>>>> today but with blkdev_get_by_path()[1]. For the passthru code, we need
>>>>>>> to get a nvme_ctrl instead of a block_device, but the principal is the same.
>>>>>>
>>>>>> Why isn't a fd being passed in there instead of a random string?
>>>>>
>>>>> I suppose we could echo a string of the file descriptor number there,
>>>>> and look up the fd in the process' file descriptor table ...
>>>>
>>>> Assuming that there is a open handle somewhere out there...
>>
>> Yes, that would be a step backwards from an interface. The user would
>> then need a special process to open the fd and pass it through configfs.
>> They couldn't just do it with basic bash commands.
>
> First of all, they can, but... WTF not have filp_open() done right there?
> Yes, by name. With permission checks done. And pick your object from the
> sodding struct file you'll get.
>
> What's the problem? Why do you need cdev lookups, etc., when you are
> dealing with files under your full control? Just open them and use
> ->private_data or whatever you set in ->open() to access the damn thing.
> All there is to it...
Oh this is so much simpler. There is really no point in using anything
else. Just need to remember to compare f->f_op to what we expect to make
sure that it is indeed the same device class.

2019-07-26 06:25:59

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] nvmet: add target passthru commands support

Hi Logan,

On 7/25/19 7:23 PM, Logan Gunthorpe wrote:
> Hi,
>
> Chaitainya has asked us to take on these patches as we have an
> interest in getting them into upstream. To that end, we've done
> a large amount of testing, bug fixes and cleanup.
>
> Passthru support for nvmet allows users to export an entire
> NVMe controller through NVMe-oF. When exported in this way (as opposed
> to exporting each namespace as a block device), all the NVMe commands
> are passed to the given controller unmodified, including most admin
> commands and Vendor Unique Commands (VUCs). A passthru target will
> expose all namespaces for a given device to the remote host.
>
In general I'm very much in favour of this, yet there are some issues
which I'm not quite clear about.

> There are three major non-bugfix changes that we've done to the series:
>
> 1) Instead of using a seperate special passthru subsystem in
> configfs simply add a passthru directory that's analogous to
> the existing namespace directories. The directories have
> very similar attributes to namespaces but are mutually exclusive.
> If a user enables a namespaces, they can't then enable
> passthru controller and vice versa. This simplifies the code
> required to implement passthru configfs and IMO creates a much
> clearer and uniform interface.
>
How do you handle subsystem naming?
If you enable the 'passthru' device, the (nvmet) subsystem (and its
name) is already created. Yet the passthru device will have its own
internal subsystem naming, so if you're not extra careful you'll end up
with a nvmet subsystem which doesn't have any relationship with the
passthru subsystem, making addressing etc ... tricky.
Any thoughts about that?

Similarly: how do you propose to handle multipath devices?
Any NVMe with several paths will be enabling NVMe multipathing
automatically, presenting you with a single multipathed namespace.
How will these devices be treated?
Will the multipathed namespace be used for passthru?

Cheers,

Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

2019-07-26 07:14:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()

On Thu, Jul 25, 2019 at 09:29:40PM -0700, Sagi Grimberg wrote:
>
> > > > > > > > NVMe-OF is configured using configfs. The target is specified by the
> > > > > > > > user writing a path to a configfs attribute. This is the way it works
> > > > > > > > today but with blkdev_get_by_path()[1]. For the passthru code, we need
> > > > > > > > to get a nvme_ctrl instead of a block_device, but the principal is the same.
> > > > > > >
> > > > > > > Why isn't a fd being passed in there instead of a random string?
> > > > > >
> > > > > > I suppose we could echo a string of the file descriptor number there,
> > > > > > and look up the fd in the process' file descriptor table ...
> > > > >
> > > > > Assuming that there is a open handle somewhere out there...
> > >
> > > Yes, that would be a step backwards from an interface. The user would
> > > then need a special process to open the fd and pass it through configfs.
> > > They couldn't just do it with basic bash commands.
> >
> > First of all, they can, but... WTF not have filp_open() done right there?
> > Yes, by name. With permission checks done. And pick your object from the
> > sodding struct file you'll get.
> >
> > What's the problem? Why do you need cdev lookups, etc., when you are
> > dealing with files under your full control? Just open them and use
> > ->private_data or whatever you set in ->open() to access the damn thing.
> > All there is to it...
> Oh this is so much simpler. There is really no point in using anything
> else. Just need to remember to compare f->f_op to what we expect to make
> sure that it is indeed the same device class.

That is good, that solves the "/dev/random/" issue I was talking about
earlier as well.

Odds are you all can do the same for the blockdev interface as well.

thanks,

greg k-h

2019-07-26 15:47:58

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 02/16] chardev: introduce cdev_get_by_path()



On 2019-07-25 10:29 p.m., Sagi Grimberg wrote:
>
>>>>>>>> NVMe-OF is configured using configfs. The target is specified by
>>>>>>>> the
>>>>>>>> user writing a path to a configfs attribute. This is the way it
>>>>>>>> works
>>>>>>>> today but with blkdev_get_by_path()[1]. For the passthru code,
>>>>>>>> we need
>>>>>>>> to get a nvme_ctrl instead of a block_device, but the principal
>>>>>>>> is the same.
>>>>>>>
>>>>>>> Why isn't a fd being passed in there instead of a random string?
>>>>>>
>>>>>> I suppose we could echo a string of the file descriptor number there,
>>>>>> and look up the fd in the process' file descriptor table ...
>>>>>
>>>>> Assuming that there is a open handle somewhere out there...
>>>
>>> Yes, that would be a step backwards from an interface. The user would
>>> then need a special process to open the fd and pass it through configfs.
>>> They couldn't just do it with basic bash commands.
>>
>> First of all, they can, but... WTF not have filp_open() done right there?
>> Yes, by name.  With permission checks done.  And pick your object from
>> the
>> sodding struct file you'll get.
>>
>> What's the problem?  Why do you need cdev lookups, etc., when you are
>> dealing with files under your full control?  Just open them and use
>> ->private_data or whatever you set in ->open() to access the damn thing.
>> All there is to it...
> Oh this is so much simpler. There is really no point in using anything
> else. Just need to remember to compare f->f_op to what we expect to make
> sure that it is indeed the same device class.

Yes, that sounds like a good idea. I'll do this for v2.

Thanks,

Logan

2019-07-26 21:48:56

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] nvmet: add target passthru commands support



On 2019-07-26 12:23 a.m., Hannes Reinecke wrote:
> How do you handle subsystem naming?
> If you enable the 'passthru' device, the (nvmet) subsystem (and its
> name) is already created. Yet the passthru device will have its own
> internal subsystem naming, so if you're not extra careful you'll end up
> with a nvmet subsystem which doesn't have any relationship with the
> passthru subsystem, making addressing etc ... tricky.
> Any thoughts about that?

Well I can't say I have a great understanding of how multipath works, but...

I don't think it necessarily makes sense for the target subsynqn and the
target's device nqn to be the same. It would be weird for a user to want
to use the same device and a passed through device (through a loop) as
part of the same subsystem. That being said, it's possible for the user
to use the subsysnqn from the passed through device for the name of the
subsys of the target. I tried this and it works except for the fact that
the device I'm passing through doesn't set id->cmic.

> Similarly: how do you propose to handle multipath devices?
> Any NVMe with several paths will be enabling NVMe multipathing
> automatically, presenting you with a single multipathed namespace.
> How will these devices be treated?

Well passthru works on the controller level not on the namespace level.
So it can't make use of the multipath handling on the target system.

The one case that I think makes sense to me, but I don't know how if we
can handle, is if the user had a couple multipath enabled controllers
with the same subsynqn and wanted to passthru all of them to another
system and use multipath on the host with both controllers. This would
require having multiple target subsystems with the same name which I
don't think will work too well.

> Will the multipathed namespace be used for passthru?

Nope.

Honestly, I think the answer is if someone wants to use multipathed
controllers they should use regular NVMe-of as it doesn't really mesh
well with the passthru approach.

Logan

2019-07-26 23:18:37

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] nvmet: add target passthru commands support


>> How do you handle subsystem naming?
>> If you enable the 'passthru' device, the (nvmet) subsystem (and its
>> name) is already created. Yet the passthru device will have its own
>> internal subsystem naming, so if you're not extra careful you'll end up
>> with a nvmet subsystem which doesn't have any relationship with the
>> passthru subsystem, making addressing etc ... tricky.
>> Any thoughts about that?
>
> Well I can't say I have a great understanding of how multipath works, but...

Why is this related to multipath?

> I don't think it necessarily makes sense for the target subsynqn and the
> target's device nqn to be the same. It would be weird for a user to want
> to use the same device and a passed through device (through a loop) as
> part of the same subsystem. That being said, it's possible for the user
> to use the subsysnqn from the passed through device for the name of the
> subsys of the target. I tried this and it works except for the fact that
> the device I'm passing through doesn't set id->cmic.

I don't see why should the subsystem nqn should be the same name. Its
just like any other nvmet subsystem, just happens to have a nvme
controller in the backend (which it knows about). No reason to have
the same name IMO.

>> Similarly: how do you propose to handle multipath devices?
>> Any NVMe with several paths will be enabling NVMe multipathing
>> automatically, presenting you with a single multipathed namespace.
>> How will these devices be treated?
>
> Well passthru works on the controller level not on the namespace level.
> So it can't make use of the multipath handling on the target system.

Why? if nvmet is capable, why shouldn't we support it?

> The one case that I think makes sense to me, but I don't know how if we
> can handle, is if the user had a couple multipath enabled controllers
> with the same subsynqn

That is usually the case, there is no multipathing defined across NVM
subsystems (at least for now).

> and wanted to passthru all of them to another
> system and use multipath on the host with both controllers. This would
> require having multiple target subsystems with the same name which I
> don't think will work too well.

Don't understand why this is the case?

AFAICT, all nvmet needs to do is:
1. override cimc
2. allow allocating multiple controllers to the pt ctrl as long as the
hostnqn match.
3. answer all the ana stuff.

What else is missing?

>> Will the multipathed namespace be used for passthru?
>
> Nope.
>
> Honestly, I think the answer is if someone wants to use multipathed
> controllers they should use regular NVMe-of as it doesn't really mesh
> well with the passthru approach.

Maybe I'm missing something, but they should be orthogonal.. I know that
its sort of not real passthru, but we are exposing an nvme device across
a fabric, I think its reasonable to have some adjustments on top.

2019-07-26 23:24:33

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] nvmet: add target passthru commands support


>> Why? if nvmet is capable, why shouldn't we support it?
>
> I'm saying that passthru is exporting a specific controller and submits
> commands (both admin and IO) straight to the nvme_ctrl's queues. It's
> not exporting an nvme_subsys and I think it would be troublesome to do
> so; for example, if the target receives an admin command which ctrl of
> the subsystem should it send it to?

Its the same controller in the backend, what is the difference from
which fabrics controller the admin command came from?

> There's also no userspace handle for
> a given subsystem we'd maybe have to use the subsysnqn.

Umm, not sure I understand what you mean.

>>> The one case that I think makes sense to me, but I don't know how if we
>>> can handle, is if the user had a couple multipath enabled controllers
>>> with the same subsynqn
>>
>> That is usually the case, there is no multipathing defined across NVM
>> subsystems (at least for now).
>>
>>> and wanted to passthru all of them to another
>>> system and use multipath on the host with both controllers. This would
>>> require having multiple target subsystems with the same name which I
>>> don't think will work too well.
>>
>> Don't understand why this is the case?
>>
>> AFAICT, all nvmet needs to do is:
>> 1. override cimc
>> 2. allow allocating multiple controllers to the pt ctrl as long as the
>> hostnqn match.
>> 3. answer all the ana stuff.
>
> But with this scheme the host will only see one controller and then the
> target would have to make decisions on which ctrl to send any commands
> to. Maybe it could be done for I/O but I don't see how it could be done
> correctly for admin commands.

I haven't thought this through so its very possible that I'm missing
something, but why can't the host see multiple controllers if it has
more than one path to the target?

What specific admin commands are you concerned about? What exactly
would clash?

> And from the hosts perspective, having cimc set doesn't help anything
> because we've limited the passthru code to only accept one connection
> from one host so the host can only actually have one route to this
> controller.

And I'm suggesting to allow more than a single controller given that all
controller allocations match a single hostnqn. It wouldn't make sense to
expose this controller to multiple hosts (although that might be doable
but but definitely requires non-trivial infrastructure around it).

Look, when it comes to fabrics, multipath is a fundamental piece of the
puzzle. Not supporting multipathing significantly diminishes the value
of this in my mind (assuming this answers a real-world use-case).

2019-07-27 00:11:51

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] nvmet: add target passthru commands support



On 2019-07-26 4:21 p.m., Sagi Grimberg wrote:
>> I don't think it necessarily makes sense for the target subsynqn and the
>> target's device nqn to be the same. It would be weird for a user to want
>> to use the same device and a passed through device (through a loop) as
>> part of the same subsystem. That being said, it's possible for the user
>> to use the subsysnqn from the passed through device for the name of the
>> subsys of the target. I tried this and it works except for the fact that
>> the device I'm passing through doesn't set id->cmic.
>
> I don't see why should the subsystem nqn should be the same name. Its
> just like any other nvmet subsystem, just happens to have a nvme
> controller in the backend (which it knows about). No reason to have
> the same name IMO.

Agreed.

>>> Similarly: how do you propose to handle multipath devices?
>>> Any NVMe with several paths will be enabling NVMe multipathing
>>> automatically, presenting you with a single multipathed namespace.
>>> How will these devices be treated?
>>
>> Well passthru works on the controller level not on the namespace level.
>> So it can't make use of the multipath handling on the target system.
>
> Why? if nvmet is capable, why shouldn't we support it?

I'm saying that passthru is exporting a specific controller and submits
commands (both admin and IO) straight to the nvme_ctrl's queues. It's
not exporting an nvme_subsys and I think it would be troublesome to do
so; for example, if the target receives an admin command which ctrl of
the subsystem should it send it to? There's also no userspace handle for
a given subsystem we'd maybe have to use the subsysnqn.

>> The one case that I think makes sense to me, but I don't know how if we
>> can handle, is if the user had a couple multipath enabled controllers
>> with the same subsynqn
>
> That is usually the case, there is no multipathing defined across NVM
> subsystems (at least for now).
>
>> and wanted to passthru all of them to another
>> system and use multipath on the host with both controllers. This would
>> require having multiple target subsystems with the same name which I
>> don't think will work too well.
>
> Don't understand why this is the case?
>
> AFAICT, all nvmet needs to do is:
> 1. override cimc
> 2. allow allocating multiple controllers to the pt ctrl as long as the
> hostnqn match.
> 3. answer all the ana stuff.

But with this scheme the host will only see one controller and then the
target would have to make decisions on which ctrl to send any commands
to. Maybe it could be done for I/O but I don't see how it could be done
correctly for admin commands.

And from the hosts perspective, having cimc set doesn't help anything
because we've limited the passthru code to only accept one connection
from one host so the host can only actually have one route to this
controller.

Logan

2019-07-27 01:38:39

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] nvmet: add target passthru commands support



On 2019-07-26 5:13 p.m., Sagi Grimberg wrote:
>
>>> Why? if nvmet is capable, why shouldn't we support it?
>>
>> I'm saying that passthru is exporting a specific controller and submits
>> commands (both admin and IO) straight to the nvme_ctrl's queues. It's
>> not exporting an nvme_subsys and I think it would be troublesome to do
>> so; for example, if the target receives an admin command which ctrl of
>> the subsystem should it send it to?
>
> Its the same controller in the backend, what is the difference from
> which fabrics controller the admin command came from?

This is not my understanding. It's not really the same controller in the
back end and there are admin commands that operate on the controller
like the namespace attachment command which takes a list of cntlids
(though admittedly is not something I'm too familiar with because I
don't have any hardware to play around with). Though that command is
already a bit problematic for passthru because we have different cntlid
address spaces.

> I haven't thought this through so its very possible that I'm missing
> something, but why can't the host see multiple controllers if it has
> more than one path to the target?

Well a target controller is created for each connection. So if the host
wanted to see multiple controllers it would have to do multiple "nvme
connects" and some how need to address the individual controllers for
each connection. Right now a connect is based on subsysnqn which would
be the same for every multipath controller.

> What specific admin commands are you concerned about? What exactly
> would clash?

Namespace attach comes to mind.

> And I'm suggesting to allow more than a single controller given that all
> controller allocations match a single hostnqn. It wouldn't make sense to
> expose this controller to multiple hosts (although that might be doable
> but but definitely requires non-trivial infrastructure around it).

> Look, when it comes to fabrics, multipath is a fundamental piece of the
> puzzle. Not supporting multipathing significantly diminishes the value
> of this in my mind (assuming this answers a real-world use-case).

I'd agree with that. But it's the multipath through different ports that
seems important for fabrics. ie. If I have a host with a path through
RDMA and a path through TCP they should both work and allow fail over.
This is quite orthogonal to passthru and would be easily supported if we
dropped the multiple hosts restriction (I'm not sure what the objection
really is to that).

This is different from multipath on say a multi-controller PCI device
and trying to expose both those controllers through passthru. this is
where the problems we are discussing come up. Supporting this is what is
hard and I think the sensible answer is if users want to do something
like that, they use non-passthru NVME-of and the multipath code will
just work as designed.

Our real-world use case is to support our PCI device which has a bunch
of vendor unique commands and isn't likely to support multiple
controllers in the foreseeable future.

Logan

2019-07-27 01:41:44

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] nvmet: add target passthru commands support

> This is different from multipath on say a multi-controller PCI device
> and trying to expose both those controllers through passthru. this is
> where the problems we are discussing come up.

I *think* there is some confusion. I *think* Sagi is talking about network multi-path (i.e. the ability for the host to connect to a controller on the target via two different network paths that fail-over as needed). I *think* Logan is talking about multi-port PCIe NVMe devices that allow namespaces to be accessed via more than one controller over PCIe (dual-port NVMe SSDs being the most obvious example of this today).

> But it's the multipath through different ports that
> seems important for fabrics. ie. If I have a host with a path through
> RDMA and a path through TCP they should both work and allow fail over.

Yes, or even two paths that are both RDMA or both TCP but which take a different path through the network from host to target.

> Our real-world use case is to support our PCI device which has a bunch
> of vendor unique commands and isn't likely to support multiple
> controllers in the foreseeable future.

I think solving passthru for single-port PCIe controllers would be a good start.

Stephen

2019-07-29 16:43:59

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] nvmet: add target passthru commands support


>> This is different from multipath on say a multi-controller PCI device
>> and trying to expose both those controllers through passthru. this is
>> where the problems we are discussing come up.
>
> I *think* there is some confusion. I *think* Sagi is talking about network multi-path (i.e. the ability for the host to connect to a controller on the target via two different network paths that fail-over as needed). I *think* Logan is talking about multi-port PCIe NVMe devices that allow namespaces to be accessed via more than one controller over PCIe (dual-port NVMe SSDs being the most obvious example of this today).

Yes, I was referring to fabrics multipathing which is somewhat
orthogonal to the backend pci multipathing (unless I'm missing
something).

>> But it's the multipath through different ports that
>> seems important for fabrics. ie. If I have a host with a path through
>> RDMA and a path through TCP they should both work and allow fail over.
>
> Yes, or even two paths that are both RDMA or both TCP but which take a different path through the network from host to target.
>
>> Our real-world use case is to support our PCI device which has a bunch
>> of vendor unique commands and isn't likely to support multiple
>> controllers in the foreseeable future.
>
> I think solving passthru for single-port PCIe controllers would be a good start.

Me too.

2019-07-29 16:46:46

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] nvmet: add target passthru commands support



On 2019-07-29 10:15 a.m., Sagi Grimberg wrote:
>
>>> This is different from multipath on say a multi-controller PCI device
>>> and trying to expose both those controllers through passthru. this is
>>> where the problems we are discussing come up.
>>
>> I *think* there is some confusion. I *think* Sagi is talking about network multi-path (i.e. the ability for the host to connect to a controller on the target via two different network paths that fail-over as needed). I *think* Logan is talking about multi-port PCIe NVMe devices that allow namespaces to be accessed via more than one controller over PCIe (dual-port NVMe SSDs being the most obvious example of this today).
>
> Yes, I was referring to fabrics multipathing which is somewhat
> orthogonal to the backend pci multipathing (unless I'm missing
> something).

Yes, so if we focus on the fabrics multipathing, the only issue I see is
that only one controller can be connected to a passthru target (I
believe this was at your request) so two paths simply cannot exist to
begin with. I can easily remove that restriction.

Logan