Hi,
This patch series aims to provide a more fine grained control over
nvme's native multipathing, by allowing it to be switched on and off
on a per-subsystem basis instead of a big global switch.
The prime use-case is for mixed scenarios where user might want to use
nvme's native multipathing on one subset of subsystems and
dm-multipath on another subset.
For example using native for internal the PCIe NVMe and dm-mpath for
the connection to an NVMe over Fabrics Array.
The initial discussion for this was held at this year's LSF/MM and the
architecture hasn't changed to what we've discussed there.
The first patch does the said switch and Mike added two follow up
patches to access the personality attribute from the block device's
sysfs directory as well.
I do have a blktests test for it as well but due to the fcloop but I
reported I'm reluctant to include it in the series (or I would need to
uncomment the rmmods).
Johannes Thumshirn (1):
nvme: provide a way to disable nvme mpath per subsystem
Mike Snitzer (2):
nvme multipath: added SUBSYS_ATTR_RW
nvme multipath: add dev_attr_mpath_personality
drivers/nvme/host/core.c | 112 ++++++++++++++++++++++++++++++++++++++++--
drivers/nvme/host/multipath.c | 34 +++++++++++--
drivers/nvme/host/nvme.h | 8 +++
3 files changed, 144 insertions(+), 10 deletions(-)
--
2.16.3
From: Mike Snitzer <[email protected]>
Prep for adding dev_attr in addition to subsys_attr for 'mpath_personality'.
Signed-off-by: Mike Snitzer <[email protected]>
---
drivers/nvme/host/core.c | 34 ++++++++++++++++++----------------
1 file changed, 18 insertions(+), 16 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3a1c70bd9008..7105980dde3f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2082,13 +2082,15 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
return NULL;
}
-#define SUBSYS_ATTR_RW(_name) \
- struct device_attribute subsys_attr_##_name = \
- __ATTR_RW(_name)
+#define SUBSYS_ATTR(_name, _mode, _show, _store) \
+ struct device_attribute subsys_attr_##_name = \
+ __ATTR(_name, _mode, _show, _store)
-#define SUBSYS_ATTR_RO(_name, _mode, _show) \
- struct device_attribute subsys_attr_##_name = \
- __ATTR(_name, _mode, _show, NULL)
+#define SUBSYS_ATTR_RO(_name, _show) \
+ SUBSYS_ATTR(_name, 0444, _show, NULL)
+
+#define SUBSYS_ATTR_RW(_name, _show, _store) \
+ SUBSYS_ATTR(_name, 0644, _show, _store)
static ssize_t nvme_subsys_show_nqn(struct device *dev,
struct device_attribute *attr,
@@ -2099,7 +2101,7 @@ static ssize_t nvme_subsys_show_nqn(struct device *dev,
return snprintf(buf, PAGE_SIZE, "%s\n", subsys->subnqn);
}
-static SUBSYS_ATTR_RO(subsysnqn, S_IRUGO, nvme_subsys_show_nqn);
+static SUBSYS_ATTR_RO(subsysnqn, nvme_subsys_show_nqn);
#define nvme_subsys_show_str_function(field) \
static ssize_t subsys_##field##_show(struct device *dev, \
@@ -2110,17 +2112,16 @@ static ssize_t subsys_##field##_show(struct device *dev, \
return sprintf(buf, "%.*s\n", \
(int)sizeof(subsys->field), subsys->field); \
} \
-static SUBSYS_ATTR_RO(field, S_IRUGO, subsys_##field##_show);
+static SUBSYS_ATTR_RO(field, subsys_##field##_show);
nvme_subsys_show_str_function(model);
nvme_subsys_show_str_function(serial);
nvme_subsys_show_str_function(firmware_rev);
-
#ifdef CONFIG_NVME_MULTIPATH
-static ssize_t mpath_personality_show(struct device *dev,
- struct device_attribute *attr,
- char *buf)
+static ssize_t nvme_subsys_show_mpath_personality(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
{
struct nvme_subsystem *subsys =
container_of(dev, struct nvme_subsystem, dev);
@@ -2134,9 +2135,9 @@ static ssize_t mpath_personality_show(struct device *dev,
return ret;
}
-static ssize_t mpath_personality_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
+static ssize_t nvme_subsys_store_mpath_personality(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
{
struct nvme_subsystem *subsys =
container_of(dev, struct nvme_subsystem, dev);
@@ -2161,7 +2162,8 @@ static ssize_t mpath_personality_store(struct device *dev,
out:
return ret ? ret : count;
}
-static SUBSYS_ATTR_RW(mpath_personality);
+static SUBSYS_ATTR_RW(mpath_personality, nvme_subsys_show_mpath_personality,
+ nvme_subsys_store_mpath_personality);
#endif
static struct attribute *nvme_subsys_attrs[] = {
--
2.16.3
Provide a way to disable NVMe native multipathing on a per-subsystem
basis to enable a user to use dm-mpath and nvme native multipathing on
the same host for different nvme devices.
Signed-off-by: Johannes Thumshirn <[email protected]>
---
drivers/nvme/host/core.c | 63 +++++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/multipath.c | 34 +++++++++++++++++++----
drivers/nvme/host/nvme.h | 8 ++++++
3 files changed, 100 insertions(+), 5 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 99b857e5a7a9..3a1c70bd9008 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2082,6 +2082,10 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
return NULL;
}
+#define SUBSYS_ATTR_RW(_name) \
+ struct device_attribute subsys_attr_##_name = \
+ __ATTR_RW(_name)
+
#define SUBSYS_ATTR_RO(_name, _mode, _show) \
struct device_attribute subsys_attr_##_name = \
__ATTR(_name, _mode, _show, NULL)
@@ -2112,11 +2116,62 @@ nvme_subsys_show_str_function(model);
nvme_subsys_show_str_function(serial);
nvme_subsys_show_str_function(firmware_rev);
+
+#ifdef CONFIG_NVME_MULTIPATH
+static ssize_t mpath_personality_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_subsystem *subsys =
+ container_of(dev, struct nvme_subsystem, dev);
+ ssize_t ret;
+
+ if (subsys->native_mpath)
+ ret = scnprintf(buf, PAGE_SIZE, "[native] other\n");
+ else
+ ret = scnprintf(buf, PAGE_SIZE, "native [other]\n");
+
+ return ret;
+}
+
+static ssize_t mpath_personality_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct nvme_subsystem *subsys =
+ container_of(dev, struct nvme_subsystem, dev);
+ bool native_mpath = false;
+ int ret = 0;
+
+ if (!strncmp(buf, "native", strlen("native")))
+ native_mpath = true;
+ else if (!strncmp(buf, "other", strlen("other")))
+ native_mpath = false;
+ else {
+ pr_warn("unknown value %s\n", buf);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (subsys->native_mpath != native_mpath) {
+ subsys->native_mpath = native_mpath;
+ ret = nvme_mpath_change_personality(subsys);
+ }
+
+out:
+ return ret ? ret : count;
+}
+static SUBSYS_ATTR_RW(mpath_personality);
+#endif
+
static struct attribute *nvme_subsys_attrs[] = {
&subsys_attr_model.attr,
&subsys_attr_serial.attr,
&subsys_attr_firmware_rev.attr,
&subsys_attr_subsysnqn.attr,
+#ifdef CONFIG_NVME_MULTIPATH
+ &subsys_attr_mpath_personality.attr,
+#endif
NULL,
};
@@ -2220,6 +2275,10 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
mutex_unlock(&subsys->lock);
+#ifdef CONFIG_NVME_MULTIPATH
+ subsys->native_mpath = nvme_multipath;
+#endif
+
return 0;
out_unlock:
@@ -2850,6 +2909,10 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
head->ns_id = nsid;
kref_init(&head->ref);
+#ifdef CONFIG_NVME_MULTIPATH
+ head->native_mpath = ctrl->subsys->native_mpath;
+#endif
+
nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
ret = __nvme_check_ids(ctrl->subsys, head);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index d7b664ae5923..53d2610605ca 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -14,8 +14,8 @@
#include <linux/moduleparam.h>
#include "nvme.h"
-static bool multipath = true;
-module_param(multipath, bool, 0444);
+bool nvme_multipath = true;
+module_param_named(multipath, nvme_multipath, bool, 0444);
MODULE_PARM_DESC(multipath,
"turn on native support for multiple controllers per subsystem");
@@ -29,7 +29,7 @@ MODULE_PARM_DESC(multipath,
void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
struct nvme_ctrl *ctrl, int *flags)
{
- if (!multipath) {
+ if (!ctrl->subsys->native_mpath) {
sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance);
} else if (ns->head->disk) {
sprintf(disk_name, "nvme%dc%dn%d", ctrl->subsys->instance,
@@ -181,7 +181,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
* We also do this for private namespaces as the namespace sharing data could
* change after a rescan.
*/
- if (!(ctrl->subsys->cmic & (1 << 1)) || !multipath)
+ if (!(ctrl->subsys->cmic & (1 << 1)) || !ctrl->subsys->native_mpath)
return 0;
q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, NULL);
@@ -218,7 +218,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
void nvme_mpath_add_disk(struct nvme_ns_head *head)
{
- if (!head->disk)
+ if (!head->disk || !head->native_mpath)
return;
mutex_lock(&head->subsys->lock);
@@ -246,3 +246,27 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
blk_cleanup_queue(head->disk->queue);
put_disk(head->disk);
}
+
+int nvme_mpath_change_personality(struct nvme_subsystem *subsys)
+{
+ struct nvme_ctrl *ctrl;
+ int ret = 0;
+
+restart:
+ mutex_lock(&subsys->lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ if (!list_empty(&ctrl->namespaces)) {
+ mutex_unlock(&subsys->lock);
+ nvme_remove_namespaces(ctrl);
+ goto restart;
+ }
+ }
+ mutex_unlock(&subsys->lock);
+
+ mutex_lock(&subsys->lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ nvme_queue_scan(ctrl);
+ mutex_unlock(&subsys->lock);
+
+ return ret;
+}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 17d2f7cf3fed..f7253c074a89 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -32,6 +32,8 @@ extern unsigned int admin_timeout;
#define NVME_DEFAULT_KATO 5
#define NVME_KATO_GRACE 10
+extern bool nvme_multipath;
+
extern struct workqueue_struct *nvme_wq;
extern struct workqueue_struct *nvme_reset_wq;
extern struct workqueue_struct *nvme_delete_wq;
@@ -232,6 +234,10 @@ struct nvme_subsystem {
u8 cmic;
u16 vendor_id;
struct ida ns_ida;
+
+#ifdef CONFIG_NVME_MULTIPATH
+ bool native_mpath;
+#endif
};
/*
@@ -257,6 +263,7 @@ struct nvme_ns_head {
struct bio_list requeue_list;
spinlock_t requeue_lock;
struct work_struct requeue_work;
+ bool native_mpath;
#endif
struct list_head list;
struct srcu_struct srcu;
@@ -449,6 +456,7 @@ void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
void nvme_mpath_add_disk(struct nvme_ns_head *head);
void nvme_mpath_remove_disk(struct nvme_ns_head *head);
+int nvme_mpath_change_personality(struct nvme_subsystem *subsys);
static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
--
2.16.3
From: Mike Snitzer <[email protected]>
Allows the ability to consistently access 'mpath_personality' regardless
of which mode we're in (native vs other), using:
/sys/block/nvmeXn1/device/mpath_personality
Signed-off-by: Mike Snitzer <[email protected]>
---
drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++---------
1 file changed, 47 insertions(+), 10 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7105980dde3f..e953712086ee 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2119,12 +2119,9 @@ nvme_subsys_show_str_function(serial);
nvme_subsys_show_str_function(firmware_rev);
#ifdef CONFIG_NVME_MULTIPATH
-static ssize_t nvme_subsys_show_mpath_personality(struct device *dev,
- struct device_attribute *attr,
- char *buf)
+static ssize_t __nvme_subsys_show_mpath_personality(struct nvme_subsystem *subsys,
+ char *buf)
{
- struct nvme_subsystem *subsys =
- container_of(dev, struct nvme_subsystem, dev);
ssize_t ret;
if (subsys->native_mpath)
@@ -2135,12 +2132,9 @@ static ssize_t nvme_subsys_show_mpath_personality(struct device *dev,
return ret;
}
-static ssize_t nvme_subsys_store_mpath_personality(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
+static ssize_t __nvme_subsys_store_mpath_personality(struct nvme_subsystem *subsys,
+ const char *buf, size_t count)
{
- struct nvme_subsystem *subsys =
- container_of(dev, struct nvme_subsystem, dev);
bool native_mpath = false;
int ret = 0;
@@ -2162,6 +2156,24 @@ static ssize_t nvme_subsys_store_mpath_personality(struct device *dev,
out:
return ret ? ret : count;
}
+
+static ssize_t nvme_subsys_show_mpath_personality(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_subsystem *subsys =
+ container_of(dev, struct nvme_subsystem, dev);
+ return __nvme_subsys_show_mpath_personality(subsys, buf);
+}
+
+static ssize_t nvme_subsys_store_mpath_personality(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct nvme_subsystem *subsys =
+ container_of(dev, struct nvme_subsystem, dev);
+ return __nvme_subsys_store_mpath_personality(subsys, buf, count);
+}
static SUBSYS_ATTR_RW(mpath_personality, nvme_subsys_show_mpath_personality,
nvme_subsys_store_mpath_personality);
#endif
@@ -2819,6 +2831,28 @@ static ssize_t nvme_sysfs_show_address(struct device *dev,
}
static DEVICE_ATTR(address, S_IRUGO, nvme_sysfs_show_address, NULL);
+#ifdef CONFIG_NVME_MULTIPATH
+static ssize_t nvme_sysfs_show_mpath_personality(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+ return __nvme_subsys_show_mpath_personality(ctrl->subsys, buf);
+}
+
+static ssize_t nvme_sysfs_store_mpath_personality(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+ return __nvme_subsys_store_mpath_personality(ctrl->subsys, buf, count);
+}
+static DEVICE_ATTR(mpath_personality, 0644,
+ nvme_sysfs_show_mpath_personality, nvme_sysfs_store_mpath_personality);
+#endif
+
static struct attribute *nvme_dev_attrs[] = {
&dev_attr_reset_controller.attr,
&dev_attr_rescan_controller.attr,
@@ -2831,6 +2865,9 @@ static struct attribute *nvme_dev_attrs[] = {
&dev_attr_subsysnqn.attr,
&dev_attr_address.attr,
&dev_attr_state.attr,
+#ifdef CONFIG_NVME_MULTIPATH
+ &dev_attr_mpath_personality.attr,
+#endif
NULL
};
--
2.16.3
On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote:
> Hi,
>
> This patch series aims to provide a more fine grained control over
> nvme's native multipathing, by allowing it to be switched on and off
> on a per-subsystem basis instead of a big global switch.
No. The only reason we even allowed to turn multipathing off is
because you complained about installer issues. The path forward
clearly is native multipathing and there will be no additional support
for the use cases of not using it.
On Fri, May 25 2018 at 8:53am -0400,
Johannes Thumshirn <[email protected]> wrote:
> Provide a way to disable NVMe native multipathing on a per-subsystem
> basis to enable a user to use dm-mpath and nvme native multipathing on
> the same host for different nvme devices.
>
> Signed-off-by: Johannes Thumshirn <[email protected]>
Acked-by: Mike Snitzer <[email protected]>
On Fri, May 25 2018 at 9:05am -0400,
Christoph Hellwig <[email protected]> wrote:
> On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote:
> > Hi,
> >
> > This patch series aims to provide a more fine grained control over
> > nvme's native multipathing, by allowing it to be switched on and off
> > on a per-subsystem basis instead of a big global switch.
>
> No. The only reason we even allowed to turn multipathing off is
> because you complained about installer issues. The path forward
> clearly is native multipathing and there will be no additional support
> for the use cases of not using it.
We all basically knew this would be your position. But at this year's
LSF we pretty quickly reached consensus that we do in fact need this.
Except for yourself, Sagi and afaik Martin George: all on the cc were in
attendance and agreed.
And since then we've exchanged mails to refine and test Johannes'
implementation.
You've isolated yourself on this issue. Please just accept that we all
have a pretty solid command of what is needed to properly provide
commercial support for NVMe multipath.
The ability to switch between "native" and "other" multipath absolutely
does _not_ imply anything about the winning disposition of native vs
other. It is purely about providing commercial flexibility to use
whatever solution makes sense for a given environment. The default _is_
native NVMe multipath. It is on userspace solutions for "other"
multipath (e.g. multipathd) to allow user's to whitelist an NVMe
subsystem to be switched to "other".
Hopefully this clarifies things, thanks.
Mike
On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote:
> We all basically knew this would be your position. But at this year's
> LSF we pretty quickly reached consensus that we do in fact need this.
> Except for yourself, Sagi and afaik Martin George: all on the cc were in
> attendance and agreed.
And I very mich disagree, and you'd bette come up with a good reason
to overide me as the author and maintainer of this code.
> And since then we've exchanged mails to refine and test Johannes'
> implementation.
Since when was acting behind the scenes a good argument for anything?
> Hopefully this clarifies things, thanks.
It doesn't.
The whole point we have native multipath in nvme is because dm-multipath
is the wrong architecture (and has been, long predating you, nothing
personal). And I don't want to be stuck additional decades with this
in nvme. We allowed a global opt-in to ease the three people in the
world with existing setups to keep using that, but I also said I
won't go any step further. And I stand to that.
On Fri, May 25, 2018 at 03:05:35PM +0200, Christoph Hellwig wrote:
> On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote:
> > Hi,
> >
> > This patch series aims to provide a more fine grained control over
> > nvme's native multipathing, by allowing it to be switched on and off
> > on a per-subsystem basis instead of a big global switch.
>
> No. The only reason we even allowed to turn multipathing off is
> because you complained about installer issues. The path forward
> clearly is native multipathing and there will be no additional support
> for the use cases of not using it.
First of all, it wasn't my idea and I'm just doing my job here, as I
got this task assigned at LSF and tried to do my best here.
Personally I _do_ agree with you and do not want to use dm-mpath in
nvme either (mainly because I don't really know the code and don't
want to learn yet another subsystem).
But Mike's and Hannes' arguments where reasonable as well, we do not
know if there are any existing setups we might break leading to
support calls, which we have to deal with. Personally I don't believe
there are lot's of existing nvme multipath setups out there, but who
am I to judge.
So can we find a middle ground to this? Or we'll have the
all-or-nothing situation we have in scsi-mq now again. How about
tieing the switch to a config option which is off per default?
Byte,
Johannes
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
On Fri, May 25, 2018 at 04:22:17PM +0200, Johannes Thumshirn wrote:
> But Mike's and Hannes' arguments where reasonable as well, we do not
> know if there are any existing setups we might break leading to
> support calls, which we have to deal with. Personally I don't believe
> there are lot's of existing nvme multipath setups out there, but who
> am I to judge.
I don't think existing setups are very likely, but they absolutely
are a valid reason to support the legacy mode. That is why we support
the legacy mode using the multipath module option. Once you move
to a per-subsystem switch you don't support legacy setups, you
create a maze of new setups that we need to keep compatibility
support for forever.
> So can we find a middle ground to this? Or we'll have the
> all-or-nothing situation we have in scsi-mq now again. How about
> tieing the switch to a config option which is off per default?
The middle ground is the module option. It provides 100% backwards
compatibility if used, but more importantly doesn't create hairy
runtime ABIs that we will have to support forever.
On Fri, May 25 2018 at 10:12am -0400,
Christoph Hellwig <[email protected]> wrote:
> On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote:
> > We all basically knew this would be your position. But at this year's
> > LSF we pretty quickly reached consensus that we do in fact need this.
> > Except for yourself, Sagi and afaik Martin George: all on the cc were in
> > attendance and agreed.
>
> And I very mich disagree, and you'd bette come up with a good reason
> to overide me as the author and maintainer of this code.
I hope you don't truly think this is me vs you.
Some of the reasons are:
1) we need flexibility during the transition to native NVMe multipath
2) we need to support existing customers' dm-multipath storage networks
3) asking users to use an entirely new infrastructure that conflicts
with their dm-multipath expertise and established norms is a hard
sell. Especially for environments that have a mix of traditional
multipath (FC, iSCSI, whatever) and NVMe over fabrics.
4) Layered products (both vendor provided and user developed) have been
trained to fully support and monitor dm-multipath; they have no
understanding of native NVMe multipath
> > And since then we've exchanged mails to refine and test Johannes'
> > implementation.
>
> Since when was acting behind the scenes a good argument for anything?
I mentioned our continued private collaboration to establish that this
wasn't a momentary weakness by anyone at LSF. It has had a lot of soak
time in our heads.
We did it privately because we needed a concrete proposal that works for
our needs. Rather than getting shot down over some shortcoming in an
RFC-style submission.
> > Hopefully this clarifies things, thanks.
>
> It doesn't.
>
> The whole point we have native multipath in nvme is because dm-multipath
> is the wrong architecture (and has been, long predating you, nothing
> personal). And I don't want to be stuck additional decades with this
> in nvme. We allowed a global opt-in to ease the three people in the
> world with existing setups to keep using that, but I also said I
> won't go any step further. And I stand to that.
Thing is you really don't get to dictate that to the industry. Sorry.
Reality is this ability to switch "native" vs "other" gives us the
options I've been talking about absolutely needing since the start of
this NVMe multipathing debate.
Your fighting against it for so long has prevented progress on NVMe
multipath in general. Taking this change will increase native NVMe
multipath deployment. Otherwise we're just going to have to disable
native multipath entirely for the time being. That does users a
disservice because I completely agree that there _will_ be setups where
native NVMe multipath really does offer a huge win. But those setups
could easily be deployed on the same hosts as another variant of NVMe
that really does want the use of the legacy DM multipath stack (possibly
even just for reason 4 above).
Mike
Mike,
I understand and appreciate your position but I still don't think the
arguments for enabling DM multipath are sufficiently compelling. The
whole point of ANA is for things to be plug and play without any admin
intervention whatsoever.
I also think we're getting ahead of ourselves a bit. The assumption
seems to be that NVMe ANA devices are going to be broken--or that they
will require the same amount of tweaking as SCSI devices--and therefore
DM multipath support is inevitable. However, I'm not sure that will be
the case.
> Thing is you really don't get to dictate that to the industry. Sorry.
We are in the fortunate position of being able to influence how the spec
is written. It's a great opportunity to fix the mistakes of the past in
SCSI. And to encourage the industry to ship products that don't need the
current level of manual configuration and complex management.
So I am in favor of Johannes' patches *if* we get to the point where a
Plan B is needed. But I am not entirely convinced that's the case just
yet. Let's see some more ANA devices first. And once we do, we are also
in a position where we can put some pressure on the vendors to either
amend the specification or fix their implementations to work with ANA.
--
Martin K. Petersen Oracle Linux Engineering
On Mon, May 28 2018 at 9:19pm -0400,
Martin K. Petersen <[email protected]> wrote:
>
> Mike,
>
> I understand and appreciate your position but I still don't think the
> arguments for enabling DM multipath are sufficiently compelling. The
> whole point of ANA is for things to be plug and play without any admin
> intervention whatsoever.
>
> I also think we're getting ahead of ourselves a bit. The assumption
> seems to be that NVMe ANA devices are going to be broken--or that they
> will require the same amount of tweaking as SCSI devices--and therefore
> DM multipath support is inevitable. However, I'm not sure that will be
> the case.
>
> > Thing is you really don't get to dictate that to the industry. Sorry.
>
> We are in the fortunate position of being able to influence how the spec
> is written. It's a great opportunity to fix the mistakes of the past in
> SCSI. And to encourage the industry to ship products that don't need the
> current level of manual configuration and complex management.
>
> So I am in favor of Johannes' patches *if* we get to the point where a
> Plan B is needed. But I am not entirely convinced that's the case just
> yet. Let's see some more ANA devices first. And once we do, we are also
> in a position where we can put some pressure on the vendors to either
> amend the specification or fix their implementations to work with ANA.
ANA really isn't a motivating factor for whether or not to apply this
patch. So no, I don't have any interest in waiting to apply it.
You're somehow missing that your implied "Plan A" (native NVMe
multipath) has been pushed as the only way forward for NVMe multipath
despite it being unproven. Worse, literally no userspace infrastructure
exists to control native NVMe multipath (and this is supposed to be
comforting because the spec is tightly coupled to hch's implementation
that he controls with an iron fist).
We're supposed to be OK with completely _forced_ obsolescence of
dm-multipath infrastructure that has proven itself capable of managing a
wide range of complex multipath deployments for a tremendous amount of
enterprise Linux customers (of multiple vendors)!? This is a tough sell
given the content of my previous paragraph (coupled with the fact the
next enterprise Linux versions are being hardened _now_).
No, what both Red Hat and SUSE are saying is: cool let's have a go at
"Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm
multipath) to be conditionally enabled to coexist with native NVMe
multipath?
Nobody can explain why this patch is some sort of detriment. It
literally is an amazingly simple switch that provides flexibility we
_need_. hch had some non-specific concern about this patch forcing
support of some "ABI". Which ABI is that _exactly_?
Mike
On Mon, 28 May 2018 23:02:36 -0400
Mike Snitzer <[email protected]> wrote:
> On Mon, May 28 2018 at 9:19pm -0400,
> Martin K. Petersen <[email protected]> wrote:
>
> >
> > Mike,
> >
> > I understand and appreciate your position but I still don't think
> > the arguments for enabling DM multipath are sufficiently
> > compelling. The whole point of ANA is for things to be plug and
> > play without any admin intervention whatsoever.
> >
> > I also think we're getting ahead of ourselves a bit. The assumption
> > seems to be that NVMe ANA devices are going to be broken--or that
> > they will require the same amount of tweaking as SCSI devices--and
> > therefore DM multipath support is inevitable. However, I'm not sure
> > that will be the case.
> >
> > > Thing is you really don't get to dictate that to the industry.
> > > Sorry.
> >
> > We are in the fortunate position of being able to influence how the
> > spec is written. It's a great opportunity to fix the mistakes of
> > the past in SCSI. And to encourage the industry to ship products
> > that don't need the current level of manual configuration and
> > complex management.
> >
> > So I am in favor of Johannes' patches *if* we get to the point
> > where a Plan B is needed. But I am not entirely convinced that's
> > the case just yet. Let's see some more ANA devices first. And once
> > we do, we are also in a position where we can put some pressure on
> > the vendors to either amend the specification or fix their
> > implementations to work with ANA.
>
> ANA really isn't a motivating factor for whether or not to apply this
> patch. So no, I don't have any interest in waiting to apply it.
>
Correct. That patch is _not_ to work around any perceived incompability
on the OS side.
The patch is primarily to give _admins_ a choice.
Some installations like hosting providers etc are running quite complex
scenarios, most of which are highly automated.
So for those there is a real benefit to be able to use dm-multipathing
for NVMe; they are totally fine with having a performance impact if
they can avoid to rewrite their infrastructure.
Cheers,
Hannes
On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote:
> No, what both Red Hat and SUSE are saying is: cool let's have a go at
> "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm
> multipath) to be conditionally enabled to coexist with native NVMe
> multipath?
For a "Plan B" we can still use the global knob that's already in
place (even if this reminds me so much about scsi-mq which at least we
haven't turned on in fear of performance regressions).
Let's drop the discussion here, I don't think it leads to something
else than flamewars.
Johannes
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> For a "Plan B" we can still use the global knob that's already in
> place (even if this reminds me so much about scsi-mq which at least we
> haven't turned on in fear of performance regressions).
>
> Let's drop the discussion here, I don't think it leads to something
> else than flamewars.
If our plan A doesn't work we can go back to these patches. For now
I'd rather have everyone spend their time on making Plan A work then
preparing for contingencies. Nothing prevents anyone from using these
patches already out there if they really want to, but I'd recommend
people are very careful about doing so as you'll lock yourself into
a long-term maintainance burden.
On Tue, May 29 2018 at 4:09am -0400,
Christoph Hellwig <[email protected]> wrote:
> On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> > For a "Plan B" we can still use the global knob that's already in
> > place (even if this reminds me so much about scsi-mq which at least we
> > haven't turned on in fear of performance regressions).
> >
> > Let's drop the discussion here, I don't think it leads to something
> > else than flamewars.
As the author of the original patch you're fine to want to step away from
this needlessly ugly aspect. But it doesn't change the fact that we
need answers on _why_ it is a genuinely detrimental change. (hint: we
know it isn't).
The enterprise Linux people who directly need to support multipath want
the flexibility to allow dm-multipath while simultaneously allowing
native NVMe multipathing on the same host.
Hannes Reinecke and others, if you want the flexibility this patchset
offers please provide your review/acks.
> If our plan A doesn't work we can go back to these patches. For now
> I'd rather have everyone spend their time on making Plan A work then
> preparing for contingencies. Nothing prevents anyone from using these
> patches already out there if they really want to, but I'd recommend
> people are very careful about doing so as you'll lock yourself into
> a long-term maintainance burden.
This isn't about contingencies. It is about continuing compatibility
with a sophisticated dm-multipath stack that is widely used by, and
familiar to, so many.
Christoph, you know you're being completely vague right? You're
actively denying the validity of our position (at least Hannes and I)
with handwaving and effectively FUD, e.g. "maze of new setups" and
"hairy runtime ABIs" here: https://lkml.org/lkml/2018/5/25/461
To restate my question, from https://lkml.org/lkml/2018/5/28/2179:
hch had some non-specific concern about this patch forcing
support of some "ABI". Which ABI is that _exactly_?
The incremental effort required to support NVMe in dm-multipath isn't so
grim. And those who will do that work are signing up for it -- while
still motivated to help make native NVMe multipath a success.
Can you please give us time to responsibly ween users off dm-multipath?
Mike
On Tue, May 29 2018 at 4:09am -0400,
Christoph Hellwig <[email protected]> wrote:
> On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> > For a "Plan B" we can still use the global knob that's already in
> > place (even if this reminds me so much about scsi-mq which at least we
> > haven't turned on in fear of performance regressions).
> >
> > Let's drop the discussion here, I don't think it leads to something
> > else than flamewars.
>
> If our plan A doesn't work we can go back to these patches. For now
> I'd rather have everyone spend their time on making Plan A work then
> preparing for contingencies. Nothing prevents anyone from using these
> patches already out there if they really want to, but I'd recommend
> people are very careful about doing so as you'll lock yourself into
> a long-term maintainance burden.
Restating (for others): this patchset really isn't about contingencies.
It is about choice.
Since we're at an impasse, in the hopes of soliciting definitive
feedback from Jens and Linus, I'm going to attempt to reset the
discussion for their entry.
In summary, we have a classic example of a maintainer stalemate here:
1) Christoph, as NVMe co-maintainer, doesn't want to allow native NVMe
multipath to actively coexist with dm-multipath's NVMe support on the
same host.
2) I, as DM maintainer, would like to offer this flexibility to users --
by giving them opt-in choice to continue using existing dm-multipath
with NVMe. (also, both Red Hat and SUSE would like to offer this).
There is no technical reason why they cannot coexist. Hence this simple
patchset that was originally offered by Johannes Thumshirn with
contributions from myself.
With those basics established, I'd like to ask:
Are we, as upstream kernel maintainers, really willing to force a
needlessly all-or-nothing multipath infrastructure decision on Linux
NVMe users with dm-multipath expertise? Or should we also give them an
opt-in choice to continue using the familiar, mature, dm-multipath
option -- in addition to the new default native NVMe multipath that may,
in the long-term, be easier to use and more performant?
A definitive answer to this would be very helpful.
As you can see above, Christoph is refusing to allow the opt-in option.
This will force enterprise Linux distributions to consider carrying the
patches on our own, in order to meet existing customer needs. The
maintenance burden of this is unnecessary, and it goes against our
"upstream first" mantra.
I'm well past the point for wanting to reach closure on this issue. But
I do feel strong enough about it that I'd be remiss not to solicit
feedback that lets us have no doubt about what the future holds for
upstream Linux's NVMe multipathing.
Please advise, thanks.
Mike
--
Additional background (for the benefit of others who haven't been
following along):
Jens, as block maintainer, took Christoph's NVMe change to have NVMe
internalize multiple paths to an NVMe subsystem. This is referred to as
"native NVMe multipath" (see: commit 32acab3181). As DM maintainer,
I've consistently requested we have the ability to allow users to opt-in
to exposing the underlying NVMe devices so that dm-multipath could
continue to provide a single interface for multipath configuration and
monitoring, see:
http://lists.infradead.org/pipermail/linux-nvme/2017-February/008256.html
Christoph rejected this on the principle that he dislikes the
dm-multipath architecture (being split between DM in kernel,
dm-mpath.ko, and userspace via multipathd; exposure of underlying
devices, less performant, etc). So instead, dm-multipath was prevented
from seeing these individual paths because the NVMe driver hid them.
That is, unless CONFIG_NVME_MULTIPATH isn't set or nvme_core.multipath=N
is set. And if either is used, users then cannot make use of native
NVMe multipath. So currently, native NVMe multipath vs dm-multipath is
all-or-nothing.
The feeling is we should afford users the ability to continue using
dm-multipath at their choosing. Hannes summarized the need for this
nicely here: https://lkml.org/lkml/2018/5/29/95
Please also note this from my first reply to this thread (here:
https://lkml.org/lkml/2018/5/25/438):
The ability to switch between "native" and "other" multipath absolutely
does _not_ imply anything about the winning disposition of native vs
other. It is purely about providing commercial flexibility to use
whatever solution makes sense for a given environment. The default _is_
native NVMe multipath. It is on userspace solutions for "other"
multipath (e.g. multipathd) to allow user's to whitelist an NVMe
subsystem to be switched to "other".
On 5/29/18 5:27 PM, Mike Snitzer wrote:
> On Tue, May 29 2018 at 4:09am -0400,
> Christoph Hellwig <[email protected]> wrote:
>
>> On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
>>> For a "Plan B" we can still use the global knob that's already in
>>> place (even if this reminds me so much about scsi-mq which at least we
>>> haven't turned on in fear of performance regressions).
>>>
>>> Let's drop the discussion here, I don't think it leads to something
>>> else than flamewars.
>>
>> If our plan A doesn't work we can go back to these patches. For now
>> I'd rather have everyone spend their time on making Plan A work then
>> preparing for contingencies. Nothing prevents anyone from using these
>> patches already out there if they really want to, but I'd recommend
>> people are very careful about doing so as you'll lock yourself into
>> a long-term maintainance burden.
>
> Restating (for others): this patchset really isn't about contingencies.
> It is about choice.
>
> Since we're at an impasse, in the hopes of soliciting definitive
> feedback from Jens and Linus, I'm going to attempt to reset the
> discussion for their entry.
>
> In summary, we have a classic example of a maintainer stalemate here:
> 1) Christoph, as NVMe co-maintainer, doesn't want to allow native NVMe
> multipath to actively coexist with dm-multipath's NVMe support on the
> same host.
> 2) I, as DM maintainer, would like to offer this flexibility to users --
> by giving them opt-in choice to continue using existing dm-multipath
> with NVMe. (also, both Red Hat and SUSE would like to offer this).
>
> There is no technical reason why they cannot coexist. Hence this simple
> patchset that was originally offered by Johannes Thumshirn with
> contributions from myself.
Here's what I think - flag days tend to suck. They may be more convenient
for developers, but they inflict pain on users. Sometimes they prevent
them from moving forward, since updates are now gated on external
dependencies. Moving forward with a new architecture is great, but
proper care has to be given to existing users of multipath, regardless
of how few they may be.
This patchset seems pretty clean and minimalist. Realistically, I'm
guessing that SUSE and RH will ship it regardless of upstream status.
--
Jens Axboe
On Wed, May 30 2018 at 3:05pm -0400,
Jens Axboe <[email protected]> wrote:
> On 5/29/18 5:27 PM, Mike Snitzer wrote:
> > On Tue, May 29 2018 at 4:09am -0400,
> > Christoph Hellwig <[email protected]> wrote:
> >
> >> On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> >>> For a "Plan B" we can still use the global knob that's already in
> >>> place (even if this reminds me so much about scsi-mq which at least we
> >>> haven't turned on in fear of performance regressions).
> >>>
> >>> Let's drop the discussion here, I don't think it leads to something
> >>> else than flamewars.
> >>
> >> If our plan A doesn't work we can go back to these patches. For now
> >> I'd rather have everyone spend their time on making Plan A work then
> >> preparing for contingencies. Nothing prevents anyone from using these
> >> patches already out there if they really want to, but I'd recommend
> >> people are very careful about doing so as you'll lock yourself into
> >> a long-term maintainance burden.
> >
> > Restating (for others): this patchset really isn't about contingencies.
> > It is about choice.
> >
> > Since we're at an impasse, in the hopes of soliciting definitive
> > feedback from Jens and Linus, I'm going to attempt to reset the
> > discussion for their entry.
> >
> > In summary, we have a classic example of a maintainer stalemate here:
> > 1) Christoph, as NVMe co-maintainer, doesn't want to allow native NVMe
> > multipath to actively coexist with dm-multipath's NVMe support on the
> > same host.
> > 2) I, as DM maintainer, would like to offer this flexibility to users --
> > by giving them opt-in choice to continue using existing dm-multipath
> > with NVMe. (also, both Red Hat and SUSE would like to offer this).
> >
> > There is no technical reason why they cannot coexist. Hence this simple
> > patchset that was originally offered by Johannes Thumshirn with
> > contributions from myself.
>
> Here's what I think - flag days tend to suck. They may be more convenient
> for developers, but they inflict pain on users. Sometimes they prevent
> them from moving forward, since updates are now gated on external
> dependencies. Moving forward with a new architecture is great, but
> proper care has to be given to existing users of multipath, regardless
> of how few they may be.
As I mentioned at the end of my summary (and in my first reply to this
0th header): it is on the "other" multipath toolchain to deal with
switching from "native" to "other" (by writing to 'mpath_personality').
For dm-multipath, it is multipathd that would trigger the switch from
"native" to "other" iff user opts-in by configuring multipath.conf to
own a particular NVMe subsystem's multipathing.
So users very likely won't ever have a need to write to
'mpath_personality'. And as such, they'll just default to using
"native" NVMe multipath.
> This patchset seems pretty clean and minimalist. Realistically, I'm
> guessing that SUSE and RH will ship it regardless of upstream status.
TBD really. We are keen to enable CONFIG_NVME_MULTIPATH to allow users
to use native (as the default!). But we won't be able to do that unless
we have this patchset. Because we really do _need_ to give our users
the option of continuing to use dm-multipath.
And as the Red Hat employee who would have to port it to each kernel (as
long as there is a need): _please_ don't make me do that.
PLEASE! ;)
Mike
Hi Folks,
I'm sorry to chime in super late on this, but a lot has been
going on for me lately which got me off the grid.
So I'll try to provide my input hopefully without starting any more
flames..
>>> This patch series aims to provide a more fine grained control over
>>> nvme's native multipathing, by allowing it to be switched on and off
>>> on a per-subsystem basis instead of a big global switch.
>>
>> No. The only reason we even allowed to turn multipathing off is
>> because you complained about installer issues. The path forward
>> clearly is native multipathing and there will be no additional support
>> for the use cases of not using it.
>
> We all basically knew this would be your position. But at this year's
> LSF we pretty quickly reached consensus that we do in fact need this.
> Except for yourself, Sagi and afaik Martin George: all on the cc were in
> attendance and agreed.
Correction, I wasn't able to attend LSF this year (unfortunately).
> And since then we've exchanged mails to refine and test Johannes'
> implementation.
>
> You've isolated yourself on this issue. Please just accept that we all
> have a pretty solid command of what is needed to properly provide
> commercial support for NVMe multipath.
>
> The ability to switch between "native" and "other" multipath absolutely
> does _not_ imply anything about the winning disposition of native vs
> other. It is purely about providing commercial flexibility to use
> whatever solution makes sense for a given environment. The default _is_
> native NVMe multipath. It is on userspace solutions for "other"
> multipath (e.g. multipathd) to allow user's to whitelist an NVMe
> subsystem to be switched to "other".
>
> Hopefully this clarifies things, thanks.
Mike, I understand what you're saying, but I also agree with hch on
the simple fact that this is a burden on linux nvme (although less
passionate about it than hch).
Beyond that, this is going to get much worse when we support "dispersed
namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed
namespaces" makes NVMe namespaces share-able over different subsystems
so changing the personality on a per-subsystem basis is just asking for
trouble.
Moreover, I also wanted to point out that fabrics array vendors are
building products that rely on standard nvme multipathing (and probably
multipathing over dispersed namespaces as well), and keeping a knob that
will keep nvme users with dm-multipath will probably not help them
educate their customers as well... So there is another angle to this.
Don't get me wrong, I do support your cause, and I think nvme should try
to help, I just think that subsystem granularity is not the correct
approach going forward.
As I said, I've been off the grid, can you remind me why global knob is
not sufficient?
This might sound stupid to you, but can't users that desperately must
keep using dm-multipath (for its mature toolset or what-not) just stack
it on multipath nvme device? (I might be completely off on this so
feel free to correct my ignorance).
On Wed, May 30 2018 at 5:20pm -0400,
Sagi Grimberg <[email protected]> wrote:
> Hi Folks,
>
> I'm sorry to chime in super late on this, but a lot has been
> going on for me lately which got me off the grid.
>
> So I'll try to provide my input hopefully without starting any more
> flames..
>
> >>>This patch series aims to provide a more fine grained control over
> >>>nvme's native multipathing, by allowing it to be switched on and off
> >>>on a per-subsystem basis instead of a big global switch.
> >>
> >>No. The only reason we even allowed to turn multipathing off is
> >>because you complained about installer issues. The path forward
> >>clearly is native multipathing and there will be no additional support
> >>for the use cases of not using it.
> >
> >We all basically knew this would be your position. But at this year's
> >LSF we pretty quickly reached consensus that we do in fact need this.
> >Except for yourself, Sagi and afaik Martin George: all on the cc were in
> >attendance and agreed.
>
> Correction, I wasn't able to attend LSF this year (unfortunately).
Yes, I was trying to say you weren't at LSF (but are on the cc).
> >And since then we've exchanged mails to refine and test Johannes'
> >implementation.
> >
> >You've isolated yourself on this issue. Please just accept that we all
> >have a pretty solid command of what is needed to properly provide
> >commercial support for NVMe multipath.
> >
> >The ability to switch between "native" and "other" multipath absolutely
> >does _not_ imply anything about the winning disposition of native vs
> >other. It is purely about providing commercial flexibility to use
> >whatever solution makes sense for a given environment. The default _is_
> >native NVMe multipath. It is on userspace solutions for "other"
> >multipath (e.g. multipathd) to allow user's to whitelist an NVMe
> >subsystem to be switched to "other".
> >
> >Hopefully this clarifies things, thanks.
>
> Mike, I understand what you're saying, but I also agree with hch on
> the simple fact that this is a burden on linux nvme (although less
> passionate about it than hch).
>
> Beyond that, this is going to get much worse when we support "dispersed
> namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed
> namespaces" makes NVMe namespaces share-able over different subsystems
> so changing the personality on a per-subsystem basis is just asking for
> trouble.
>
> Moreover, I also wanted to point out that fabrics array vendors are
> building products that rely on standard nvme multipathing (and probably
> multipathing over dispersed namespaces as well), and keeping a knob that
> will keep nvme users with dm-multipath will probably not help them
> educate their customers as well... So there is another angle to this.
Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO
when features like "dispersed namespaces" land a negative check would
need to be added in the code to prevent switching from "native".
And once something like "dispersed namespaces" lands we'd then have to
see about a more sophisticated switch that operates at a different
granularity. Could also be that switching one subsystem that is part of
"dispersed namespaces" would then cascade to all other associated
subsystems? Not that dissimilar from the 3rd patch in this series that
allows a 'device' switch to be done in terms of the subsystem.
Anyway, I don't know the end from the beginning on something you just
told me about ;) But we're all in this together. And can take it as it
comes. I'm merely trying to bridge the gap from old dm-multipath while
native NVMe multipath gets its legs.
In time I really do have aspirations to contribute more to NVMe
multipathing. I think Christoph's NVMe multipath implementation of
bio-based device ontop on NVMe core's blk-mq device(s) is very clever
and effective (blk_steal_bios() hack and all).
> Don't get me wrong, I do support your cause, and I think nvme should try
> to help, I just think that subsystem granularity is not the correct
> approach going forward.
I understand there will be limits to this 'mpath_personality' knob's
utility and it'll need to evolve over time. But the burden of making
more advanced NVMe multipath features accessible outside of native NVMe
isn't intended to be on any of the NVMe maintainers (other than maybe
remembering to disallow the switch where it makes sense in the future).
> As I said, I've been off the grid, can you remind me why global knob is
> not sufficient?
Because once nvme_core.multipath=N is set: native NVMe multipath is then
not accessible from the same host. The goal of this patchset is to give
users choice. But not limit them to _only_ using dm-multipath if they
just have some legacy needs.
Tough to be convincing with hypotheticals but I could imagine a very
obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
"fabrics" (especially if/when the numa-based path selector lands). But
the same host with PCI NVMe could be connected to a FC network that has
historically always been managed via dm-multipath.. but say that
FC-based infrastructure gets updated to use NVMe (to leverage a wider
NVMe investment, whatever?) -- but maybe admins would still prefer to
use dm-multipath for the NVMe over FC.
> This might sound stupid to you, but can't users that desperately must
> keep using dm-multipath (for its mature toolset or what-not) just
> stack it on multipath nvme device? (I might be completely off on
> this so feel free to correct my ignorance).
We could certainly pursue adding multipath-tools support for native NVMe
multipathing. Not opposed to it (even if just reporting topology and
state). But given the extensive lengths NVMe multipath goes to hide
devices we'd need some way to piercing through the opaque nvme device
that native NVMe multipath exposes. But that really is a tangent
relative to this patchset. Since that kind of visibility would also
benefit the nvme cli... otherwise how are users to even be able to trust
but verify native NVMe multipathing did what it expected it to?
Mike
On Wed, May 30 2018 at 5:20pm -0400,
Sagi Grimberg <[email protected]> wrote:
> Moreover, I also wanted to point out that fabrics array vendors are
> building products that rely on standard nvme multipathing (and probably
> multipathing over dispersed namespaces as well), and keeping a knob that
> will keep nvme users with dm-multipath will probably not help them
> educate their customers as well... So there is another angle to this.
Noticed I didn't respond directly to this aspect. As I explained in
various replies to this thread: The users/admins would be the ones who
would decide to use dm-multipath. It wouldn't be something that'd be
imposed by default. If anything, the all-or-nothing
nvme_core.multipath=N would pose a much more serious concern for these
array vendors that do have designs to specifically leverage native NVMe
multipath. Because if users were to get into the habit of setting that
on the kernel commandline they'd literally _never_ be able to leverage
native NVMe multipathing.
We can also add multipath.conf docs (man page, etc) that caution admins
to consult their array vendors about whether using dm-multipath is to be
avoided, etc.
Again, this is opt-in, so on a upstream Linux kernel level the default
of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
is configured). Not seeing why there is so much angst and concern about
offering this flexibility via opt-in but I'm also glad we're having this
discussion to have our eyes wide open.
Mike
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote:
> > No, what both Red Hat and SUSE are saying is: cool let's have a go at
> > "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm
> > multipath) to be conditionally enabled to coexist with native NVMe
> > multipath?
>
> For a "Plan B" we can still use the global knob that's already in
> place (even if this reminds me so much about scsi-mq which at least we
> haven't turned on in fear of performance regressions).
BTW, for scsi-mq, we have made a little progress by commit 2f31115e940c
(scsi: core: introduce force_blk_mq), and virtio-scsi is working at
always scsi-mq mode now. Then driver can decide if .force_blk_mq needs
to be set.
Hope progress can be made in this nvme mpath issue too.
Thanks,
Ming
> @@ -246,3 +246,27 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
> blk_cleanup_queue(head->disk->queue);
> put_disk(head->disk);
> }
> +
> +int nvme_mpath_change_personality(struct nvme_subsystem *subsys)
> +{
> + struct nvme_ctrl *ctrl;
> + int ret = 0;
> +
> +restart:
> + mutex_lock(&subsys->lock);
> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> + if (!list_empty(&ctrl->namespaces)) {
> + mutex_unlock(&subsys->lock);
> + nvme_remove_namespaces(ctrl);
This looks completely broken. Any of these namespaces can have an
active handle on it.
> Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO
> when features like "dispersed namespaces" land a negative check would
> need to be added in the code to prevent switching from "native".
>
> And once something like "dispersed namespaces" lands we'd then have to
> see about a more sophisticated switch that operates at a different
> granularity. Could also be that switching one subsystem that is part of
> "dispersed namespaces" would then cascade to all other associated
> subsystems? Not that dissimilar from the 3rd patch in this series that
> allows a 'device' switch to be done in terms of the subsystem.
Which I think is broken by allowing to change this personality on the
fly.
>
> Anyway, I don't know the end from the beginning on something you just
> told me about ;) But we're all in this together. And can take it as it
> comes.
I agree but this will be exposed to user-space and we will need to live
with it for a long long time...
> I'm merely trying to bridge the gap from old dm-multipath while
> native NVMe multipath gets its legs.
>
> In time I really do have aspirations to contribute more to NVMe
> multipathing. I think Christoph's NVMe multipath implementation of
> bio-based device ontop on NVMe core's blk-mq device(s) is very clever
> and effective (blk_steal_bios() hack and all).
That's great.
>> Don't get me wrong, I do support your cause, and I think nvme should try
>> to help, I just think that subsystem granularity is not the correct
>> approach going forward.
>
> I understand there will be limits to this 'mpath_personality' knob's
> utility and it'll need to evolve over time. But the burden of making
> more advanced NVMe multipath features accessible outside of native NVMe
> isn't intended to be on any of the NVMe maintainers (other than maybe
> remembering to disallow the switch where it makes sense in the future).
I would expect that any "advanced multipath features" would be properly
brought up with the NVMe TWG as a ratified standard and find its way
to nvme. So I don't think this particularly is a valid argument.
>> As I said, I've been off the grid, can you remind me why global knob is
>> not sufficient?
>
> Because once nvme_core.multipath=N is set: native NVMe multipath is then
> not accessible from the same host. The goal of this patchset is to give
> users choice. But not limit them to _only_ using dm-multipath if they
> just have some legacy needs.
>
> Tough to be convincing with hypotheticals but I could imagine a very
> obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
> "fabrics" (especially if/when the numa-based path selector lands). But
> the same host with PCI NVMe could be connected to a FC network that has
> historically always been managed via dm-multipath.. but say that
> FC-based infrastructure gets updated to use NVMe (to leverage a wider
> NVMe investment, whatever?) -- but maybe admins would still prefer to
> use dm-multipath for the NVMe over FC.
You are referring to an array exposing media via nvmf and scsi
simultaneously? I'm not sure that there is a clean definition of
how that is supposed to work (ANA/ALUA, reservations, etc..)
>> This might sound stupid to you, but can't users that desperately must
>> keep using dm-multipath (for its mature toolset or what-not) just
>> stack it on multipath nvme device? (I might be completely off on
>> this so feel free to correct my ignorance).
>
> We could certainly pursue adding multipath-tools support for native NVMe
> multipathing. Not opposed to it (even if just reporting topology and
> state). But given the extensive lengths NVMe multipath goes to hide
> devices we'd need some way to piercing through the opaque nvme device
> that native NVMe multipath exposes. But that really is a tangent
> relative to this patchset. Since that kind of visibility would also
> benefit the nvme cli... otherwise how are users to even be able to trust
> but verify native NVMe multipathing did what it expected it to?
Can you explain what is missing for multipath-tools to resolve topology?
nvme list-subsys is doing just that, doesn't it? It lists subsys-ctrl
topology but that is sort of the important information as controllers
are the real paths.
>> Moreover, I also wanted to point out that fabrics array vendors are
>> building products that rely on standard nvme multipathing (and probably
>> multipathing over dispersed namespaces as well), and keeping a knob that
>> will keep nvme users with dm-multipath will probably not help them
>> educate their customers as well... So there is another angle to this.
>
> Noticed I didn't respond directly to this aspect. As I explained in
> various replies to this thread: The users/admins would be the ones who
> would decide to use dm-multipath. It wouldn't be something that'd be
> imposed by default. If anything, the all-or-nothing
> nvme_core.multipath=N would pose a much more serious concern for these
> array vendors that do have designs to specifically leverage native NVMe
> multipath. Because if users were to get into the habit of setting that
> on the kernel commandline they'd literally _never_ be able to leverage
> native NVMe multipathing.
>
> We can also add multipath.conf docs (man page, etc) that caution admins
> to consult their array vendors about whether using dm-multipath is to be
> avoided, etc.
>
> Again, this is opt-in, so on a upstream Linux kernel level the default
> of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
> is configured). Not seeing why there is so much angst and concern about
> offering this flexibility via opt-in but I'm also glad we're having this
> discussion to have our eyes wide open.
I think that the concern is valid and should not be dismissed. And
at times flexibility is a real source of pain, both to users and
developers.
The choice is there, no one is forbidden to use multipath. I'm just
still not sure exactly why the subsystem granularity is an absolute
must other than a volume exposed as a nvmf namespace and scsi lun (how
would dm-multipath detect this is the same device btw?)
On Thu, May 31 2018 at 4:37am -0400,
Sagi Grimberg <[email protected]> wrote:
>
> >Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO
> >when features like "dispersed namespaces" land a negative check would
> >need to be added in the code to prevent switching from "native".
> >
> >And once something like "dispersed namespaces" lands we'd then have to
> >see about a more sophisticated switch that operates at a different
> >granularity. Could also be that switching one subsystem that is part of
> >"dispersed namespaces" would then cascade to all other associated
> >subsystems? Not that dissimilar from the 3rd patch in this series that
> >allows a 'device' switch to be done in terms of the subsystem.
>
> Which I think is broken by allowing to change this personality on the
> fly.
I saw your reply to the 1/3 patch.. I do agree it is broken for not
checking if any handles are active. But that is easily fixed no?
Or are you suggesting some other aspect of "broken"?
> >Anyway, I don't know the end from the beginning on something you just
> >told me about ;) But we're all in this together. And can take it as it
> >comes.
>
> I agree but this will be exposed to user-space and we will need to live
> with it for a long long time...
OK, well dm-multipath has been around for a long long time. We cannot
simply wish it away. Regardless of whatever architectural grievances
are levied against it.
There are far more customer and vendor products that have been developed
to understand and consume dm-multipath and multipath-tools interfaces
than native NVMe multipath.
> >>Don't get me wrong, I do support your cause, and I think nvme should try
> >>to help, I just think that subsystem granularity is not the correct
> >>approach going forward.
> >
> >I understand there will be limits to this 'mpath_personality' knob's
> >utility and it'll need to evolve over time. But the burden of making
> >more advanced NVMe multipath features accessible outside of native NVMe
> >isn't intended to be on any of the NVMe maintainers (other than maybe
> >remembering to disallow the switch where it makes sense in the future).
>
> I would expect that any "advanced multipath features" would be properly
> brought up with the NVMe TWG as a ratified standard and find its way
> to nvme. So I don't think this particularly is a valid argument.
You're misreading me again. I'm also saying stop worrying. I'm saying
any future native NVMe multipath features that come about don't necessarily
get immediate dm-multipath parity. The native NVMe multipath would need
appropriate negative checks.
> >>As I said, I've been off the grid, can you remind me why global knob is
> >>not sufficient?
> >
> >Because once nvme_core.multipath=N is set: native NVMe multipath is then
> >not accessible from the same host. The goal of this patchset is to give
> >users choice. But not limit them to _only_ using dm-multipath if they
> >just have some legacy needs.
> >
> >Tough to be convincing with hypotheticals but I could imagine a very
> >obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
> >"fabrics" (especially if/when the numa-based path selector lands). But
> >the same host with PCI NVMe could be connected to a FC network that has
> >historically always been managed via dm-multipath.. but say that
> >FC-based infrastructure gets updated to use NVMe (to leverage a wider
> >NVMe investment, whatever?) -- but maybe admins would still prefer to
> >use dm-multipath for the NVMe over FC.
>
> You are referring to an array exposing media via nvmf and scsi
> simultaneously? I'm not sure that there is a clean definition of
> how that is supposed to work (ANA/ALUA, reservations, etc..)
No I'm referring to completely disjoint arrays that are homed to the
same host.
> >>This might sound stupid to you, but can't users that desperately must
> >>keep using dm-multipath (for its mature toolset or what-not) just
> >>stack it on multipath nvme device? (I might be completely off on
> >>this so feel free to correct my ignorance).
> >
> >We could certainly pursue adding multipath-tools support for native NVMe
> >multipathing. Not opposed to it (even if just reporting topology and
> >state). But given the extensive lengths NVMe multipath goes to hide
> >devices we'd need some way to piercing through the opaque nvme device
> >that native NVMe multipath exposes. But that really is a tangent
> >relative to this patchset. Since that kind of visibility would also
> >benefit the nvme cli... otherwise how are users to even be able to trust
> >but verify native NVMe multipathing did what it expected it to?
>
> Can you explain what is missing for multipath-tools to resolve topology?
I've not poured over these nvme interfaces (below I just learned
nvme-cli has since grown the capability). SO I'm not informed enough
to know if nvme cli has grown other new capabilities.
In any case, training multipath-tools to understand native NVMe
multipath topology doesn't replace actual dm-multipath interface and
associated information.
Per-device statistics is something that users want to be able to see.
Per-device up/down state, etc.
> nvme list-subsys is doing just that, doesn't it? It lists subsys-ctrl
> topology but that is sort of the important information as controllers
> are the real paths.
I had nvme cli version 1.4; which doesn't have nvme list-subsys.
Which means I need to uninstall the distro provided
nvme-cli-1.4-3.el7.x86_64 and find the relevant upstream and build from
src....
Yes, this looks like the basic topology info I was hoping for:
# nvme list-subsys
nvme-subsys0 - NQN=nqn.2014.08.org.nvmexpress:80868086PHMB7361004R280CGN INTEL SSDPED1D280GA
\
+- nvme0 pcie 0000:5e:00.0
nvme-subsys1 - NQN=mptestnqn
\
+- nvme1 fc traddr=nn-0x200140111111dbcc:pn-0x100140111111dbcc host_traddr=nn-0x200140111111dac8:pn-0x100140111111dac8
+- nvme2 fc traddr=nn-0x200140111111dbcd:pn-0x100140111111dbcd host_traddr=nn-0x200140111111dac9:pn-0x100140111111dac9
+- nvme3 fc traddr=nn-0x200140111111dbce:pn-0x100140111111dbce host_traddr=nn-0x200140111111daca:pn-0x100140111111daca
+- nvme4 fc traddr=nn-0x200140111111dbcf:pn-0x100140111111dbcf host_traddr=nn-0x200140111111dacb:pn-0x100140111111dacb
On Thu, May 31 2018 at 4:51am -0400,
Sagi Grimberg <[email protected]> wrote:
>
> >>Moreover, I also wanted to point out that fabrics array vendors are
> >>building products that rely on standard nvme multipathing (and probably
> >>multipathing over dispersed namespaces as well), and keeping a knob that
> >>will keep nvme users with dm-multipath will probably not help them
> >>educate their customers as well... So there is another angle to this.
> >
> >Noticed I didn't respond directly to this aspect. As I explained in
> >various replies to this thread: The users/admins would be the ones who
> >would decide to use dm-multipath. It wouldn't be something that'd be
> >imposed by default. If anything, the all-or-nothing
> >nvme_core.multipath=N would pose a much more serious concern for these
> >array vendors that do have designs to specifically leverage native NVMe
> >multipath. Because if users were to get into the habit of setting that
> >on the kernel commandline they'd literally _never_ be able to leverage
> >native NVMe multipathing.
> >
> >We can also add multipath.conf docs (man page, etc) that caution admins
> >to consult their array vendors about whether using dm-multipath is to be
> >avoided, etc.
> >
> >Again, this is opt-in, so on a upstream Linux kernel level the default
> >of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
> >is configured). Not seeing why there is so much angst and concern about
> >offering this flexibility via opt-in but I'm also glad we're having this
> >discussion to have our eyes wide open.
>
> I think that the concern is valid and should not be dismissed. And
> at times flexibility is a real source of pain, both to users and
> developers.
>
> The choice is there, no one is forbidden to use multipath. I'm just
> still not sure exactly why the subsystem granularity is an absolute
> must other than a volume exposed as a nvmf namespace and scsi lun (how
> would dm-multipath detect this is the same device btw?)
Please see my other reply, I was talking about completely disjoint
arrays in my hypothetical config where having the ability to allow
simultaneous use of native NVMe multipath and dm-multipath is
meaningful.
Mike
On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote:
> Because once nvme_core.multipath=N is set: native NVMe multipath is then
> not accessible from the same host. The goal of this patchset is to give
> users choice. But not limit them to _only_ using dm-multipath if they
> just have some legacy needs.
Choise by itself really isn't an argument. We need a really good
use case for all the complexity, and so far none has been presented.
> Tough to be convincing with hypotheticals but I could imagine a very
> obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
> "fabrics" (especially if/when the numa-based path selector lands). But
> the same host with PCI NVMe could be connected to a FC network that has
> historically always been managed via dm-multipath.. but say that
> FC-based infrastructure gets updated to use NVMe (to leverage a wider
> NVMe investment, whatever?) -- but maybe admins would still prefer to
> use dm-multipath for the NVMe over FC.
That is a lot of maybes. If they prefer the good old way on FC then
can easily stay with SCSI, or for that matter use the global switch
off.
> > This might sound stupid to you, but can't users that desperately must
> > keep using dm-multipath (for its mature toolset or what-not) just
> > stack it on multipath nvme device? (I might be completely off on
> > this so feel free to correct my ignorance).
>
> We could certainly pursue adding multipath-tools support for native NVMe
> multipathing. Not opposed to it (even if just reporting topology and
> state). But given the extensive lengths NVMe multipath goes to hide
> devices we'd need some way to piercing through the opaque nvme device
> that native NVMe multipath exposes. But that really is a tangent
> relative to this patchset. Since that kind of visibility would also
> benefit the nvme cli... otherwise how are users to even be able to trust
> but verify native NVMe multipathing did what it expected it to?
Just look at the nvme-cli output or sysfs. It's all been there since
the code was merged to mainline.
On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote:
> I saw your reply to the 1/3 patch.. I do agree it is broken for not
> checking if any handles are active. But that is easily fixed no?
Doing a switch at runtime simply is a really bad idea. If for some
reason we end up with a good per-controller switch it would have
to be something set at probe time, and to get it on a controller
you'd need to reset it first.
On Thu, May 31, 2018 at 11:37:20AM +0300, Sagi Grimberg wrote:
>> the same host with PCI NVMe could be connected to a FC network that has
>> historically always been managed via dm-multipath.. but say that
>> FC-based infrastructure gets updated to use NVMe (to leverage a wider
>> NVMe investment, whatever?) -- but maybe admins would still prefer to
>> use dm-multipath for the NVMe over FC.
>
> You are referring to an array exposing media via nvmf and scsi
> simultaneously? I'm not sure that there is a clean definition of
> how that is supposed to work (ANA/ALUA, reservations, etc..)
It seems like this isn't what Mike wanted, but I actually got some
requests for limited support for that to do a storage live migration
from a SCSI array to NVMe. I think it is really sketchy, but if
doable if you are careful enough. It would use dm-multipath, possibly
even on top of nvme multipathing if we have multiple nvme paths.
On Thu, May 31 2018 at 12:33pm -0400,
Christoph Hellwig <[email protected]> wrote:
> On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote:
> > Because once nvme_core.multipath=N is set: native NVMe multipath is then
> > not accessible from the same host. The goal of this patchset is to give
> > users choice. But not limit them to _only_ using dm-multipath if they
> > just have some legacy needs.
>
> Choise by itself really isn't an argument. We need a really good
> use case for all the complexity, and so far none has been presented.
OK, but its choice that is governed by higher level requirements that _I_
personally don't have. They are large datacenter deployments like
Hannes eluded to [1] where there is heavy automation and/or layered
products that are developed around dm-multipath (via libraries to access
multipath-tools provided info, etc).
So trying to pin me down on _why_ users elect to make this choice (or
that there is such annoying inertia behind their choice) really isn't
fair TBH. They exist. Please just accept that.
Now another hypothetical usecase I thought of today, that really drives
home _why_ it useful to have this fine-grained 'mpath_personality'
flexibility is: admin containers. (not saying people or companies
currently, or plan to, do this but they very easily could...):
1) container A is tasked with managing some dedicated NVMe technology
that absolutely needs native NVMe multipath.
2) container B is tasked with offering some canned layered product that
was developed ontop of dm-multipath with its own multipath-tools
oriented APIs, etc. And it is to manage some other NVMe technology on
the same host as container A.
So, containers with conflicting requirements running on the same host.
Now you can say: sorry don't do that. But that really isn't a valid
counter.
Point is it really is meaningful to offer this 'mpath_personality'
switch. I'm obviously hopeful for it to not be heavily used BUT not
providing the ability for native NVMe multipath and dm-multipath to
coexist on the same Linux host really isn't viable in the near-term.
Mike
[1] https://lkml.org/lkml/2018/5/29/95
Mike,
> 1) container A is tasked with managing some dedicated NVMe technology
> that absolutely needs native NVMe multipath.
> 2) container B is tasked with offering some canned layered product
> that was developed ontop of dm-multipath with its own multipath-tools
> oriented APIs, etc. And it is to manage some other NVMe technology on
> the same host as container A.
This assumes there is something to manage. And that the administrative
model currently employed by DM multipath will be easily applicable to
ANA devices. I don't believe that's the case. The configuration happens
on the storage side, not on the host.
With ALUA (and the proprietary implementations that predated the spec),
it was very fuzzy whether it was the host or the target that owned
responsibility for this or that. Part of the reason was that ALUA was
deliberately vague to accommodate everybody's existing, non-standards
compliant multipath storage implementations.
With ANA the heavy burden falls entirely on the storage. Most of the
things you would currently configure in multipath.conf have no meaning
in the context of ANA. Things that are currently the domain of
dm-multipath or multipathd are inextricably living either in the storage
device or in the NVMe ANA "device handler". And I think you are
significantly underestimating the effort required to expose that
information up the stack and to make use of it. That's not just a
multipath personality toggle switch.
If you want to make multipath -ll show something meaningful for ANA
devices, then by all means go ahead. I don't have any problem with
that. But I don't think the burden of allowing multipathd/DM to inject
themselves into the path transition state machine has any benefit
whatsoever to the user. It's only complicating things and therefore we'd
be doing people a disservice rather than a favor.
--
Martin K. Petersen Oracle Linux Engineering
On Thu, May 31 2018 at 12:34pm -0400,
Christoph Hellwig <[email protected]> wrote:
> On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote:
> > I saw your reply to the 1/3 patch.. I do agree it is broken for not
> > checking if any handles are active. But that is easily fixed no?
>
> Doing a switch at runtime simply is a really bad idea. If for some
> reason we end up with a good per-controller switch it would have
> to be something set at probe time, and to get it on a controller
> you'd need to reset it first.
Yes, I see that now. And the implementation would need to be something
yourself or other more seasoned NVMe developers pursued. NVMe code is
pretty unforgiving.
I took a crack at aspects of this, my head hurts. While testing I hit
some "interesting" lack of self-awareness about NVMe resources that are
in use. So lots of associations are able to be torn down rather than
graceful failure. Could be nvme_fcloop specific, but it is pretty easy
to do the following using mptest's lib/unittests/nvme_4port_create.sh
followed by: modprobe -r nvme_fcloop
Results in an infinite spew of:
[14245.345759] nvme_fcloop: fcloop_exit: Failed deleting remote port
[14245.351851] nvme_fcloop: fcloop_exit: Failed deleting target port
[14245.357944] nvme_fcloop: fcloop_exit: Failed deleting remote port
[14245.364038] nvme_fcloop: fcloop_exit: Failed deleting target port
Another fun one is to lib/unittests/nvme_4port_delete.sh while the
native NVMe multipath device (created from nvme_4port_create.sh) was
still in use by an xfs mount, so:
./nvme_4port_create.sh
mount /dev/nvme1n1 /mnt
./nvme_4port_delete.sh
umount /mnt
Those were clear screwups on my part but I wouldn't have expected them
to cause nvme to blow through so many stop signs.
Anyway, I put enough time to trying to make the previously thought
"simple" mpath_personality switch safe -- in the face of active handles
(issue Sagi pointed out) -- that it is clear NVMe just doesn't have
enough state to do it in a clean way. Would require a deeper
understanding of the code that I don't have. Most every NVMe function
returns void so there is basically no potential for error handling (in
the face of a resource being in use).
The following is my WIP patch (built ontop of the 3 patches from
this thread's series) that has cured me of wanting to continue pursuit
of a robust implementation of the runtime 'mpath_personality' switch:
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e018d0..80103b3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2146,10 +2146,8 @@ static ssize_t __nvme_subsys_store_mpath_personality(struct nvme_subsystem *subs
goto out;
}
- if (subsys->native_mpath != native_mpath) {
- subsys->native_mpath = native_mpath;
- ret = nvme_mpath_change_personality(subsys);
- }
+ if (subsys->native_mpath != native_mpath)
+ ret = nvme_mpath_change_personality(subsys, native_mpath);
out:
return ret ? ret : count;
}
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 53d2610..017c924 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -247,26 +247,57 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
put_disk(head->disk);
}
-int nvme_mpath_change_personality(struct nvme_subsystem *subsys)
+static bool __nvme_subsys_in_use(struct nvme_subsystem *subsys)
{
struct nvme_ctrl *ctrl;
- int ret = 0;
+ struct nvme_ns *ns, *next;
-restart:
- mutex_lock(&subsys->lock);
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
- if (!list_empty(&ctrl->namespaces)) {
- mutex_unlock(&subsys->lock);
- nvme_remove_namespaces(ctrl);
- goto restart;
+ down_write(&ctrl->namespaces_rwsem);
+ list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
+ if ((kref_read(&ns->kref) > 1) ||
+ // FIXME: need to compare with N paths
+ (ns->head && (kref_read(&ns->head->ref) > 1))) {
+ printk("ns->kref = %d", kref_read(&ns->kref));
+ printk("ns->head->ref = %d", kref_read(&ns->head->ref));
+ up_write(&ctrl->namespaces_rwsem);
+ mutex_unlock(&subsys->lock);
+ return true;
+ }
}
+ up_write(&ctrl->namespaces_rwsem);
}
- mutex_unlock(&subsys->lock);
+
+ return false;
+}
+
+int nvme_mpath_change_personality(struct nvme_subsystem *subsys, bool native)
+{
+ struct nvme_ctrl *ctrl;
mutex_lock(&subsys->lock);
- list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
- nvme_queue_scan(ctrl);
+
+ if (__nvme_subsys_in_use(subsys)) {
+ mutex_unlock(&subsys->lock);
+ return -EBUSY;
+ }
+
+ // FIXME: racey, subsys could now be in use here.
+ // Interlock against use needs work from an NVMe developer (hch?) :)
+
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ cancel_work_sync(&ctrl->reset_work);
+ flush_work(&ctrl->reset_work);
+ nvme_stop_ctrl(ctrl);
+ }
+
+ subsys->native_mpath = native;
mutex_unlock(&subsys->lock);
- return ret;
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ nvme_remove_namespaces(ctrl);
+ nvme_start_ctrl(ctrl);
+ }
+
+ return 0;
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 81e4e71..97a6b08 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -452,7 +452,7 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
void nvme_mpath_add_disk(struct nvme_ns_head *head);
void nvme_mpath_remove_disk(struct nvme_ns_head *head);
-int nvme_mpath_change_personality(struct nvme_subsystem *subsys);
+int nvme_mpath_change_personality(struct nvme_subsystem *subsys, bool native);
static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
On Thu, May 31 2018 at 10:40pm -0400,
Martin K. Petersen <[email protected]> wrote:
>
> Mike,
>
> > 1) container A is tasked with managing some dedicated NVMe technology
> > that absolutely needs native NVMe multipath.
>
> > 2) container B is tasked with offering some canned layered product
> > that was developed ontop of dm-multipath with its own multipath-tools
> > oriented APIs, etc. And it is to manage some other NVMe technology on
> > the same host as container A.
>
> This assumes there is something to manage. And that the administrative
> model currently employed by DM multipath will be easily applicable to
> ANA devices. I don't believe that's the case. The configuration happens
> on the storage side, not on the host.
Fair point.
> With ALUA (and the proprietary implementations that predated the spec),
> it was very fuzzy whether it was the host or the target that owned
> responsibility for this or that. Part of the reason was that ALUA was
> deliberately vague to accommodate everybody's existing, non-standards
> compliant multipath storage implementations.
>
> With ANA the heavy burden falls entirely on the storage. Most of the
> things you would currently configure in multipath.conf have no meaning
> in the context of ANA. Things that are currently the domain of
> dm-multipath or multipathd are inextricably living either in the storage
> device or in the NVMe ANA "device handler". And I think you are
> significantly underestimating the effort required to expose that
> information up the stack and to make use of it. That's not just a
> multipath personality toggle switch.
I'm aware that most everything in multipath.conf is SCSI/FC specific.
That isn't the point. dm-multipath and multipathd are an existing
framework for managing multipath storage.
It could be made to work with NVMe. But yes it would not be easy.
Especially not with the native NVMe multipath crew being so damn
hostile.
> If you want to make multipath -ll show something meaningful for ANA
> devices, then by all means go ahead. I don't have any problem with
> that.
Thanks so much for your permission ;) But I'm actually not very
involved with multipathd development anyway. It is likely a better use
of time in the near-term though. Making the multipath tools and
libraries be able to understand native NVMe multipath in all its glory
might be a means to an end from a compatibility with existing monitoring
applications perspective.
Though NVMe just doesn't have per-device accounting at all. Also not
yet aware how nvme cli conveys paths being down vs up, etc.
Glad that isn't my problem ;)
> But I don't think the burden of allowing multipathd/DM to inject
> themselves into the path transition state machine has any benefit
> whatsoever to the user. It's only complicating things and therefore we'd
> be doing people a disservice rather than a favor.
This notion that only native NVMe multipath can be successful is utter
bullshit. And the mere fact that I've gotten such a reaction from a
select few speaks to some serious control issues.
Imagine if XFS developers just one day imposed that it is the _only_
filesystem that can be used on persistent memory.
Just please dial it back.. seriously tiresome.
Good morning Mike,
> This notion that only native NVMe multipath can be successful is utter
> bullshit. And the mere fact that I've gotten such a reaction from a
> select few speaks to some serious control issues.
Please stop making this personal.
> Imagine if XFS developers just one day imposed that it is the _only_
> filesystem that can be used on persistent memory.
It's not about project X vs. project Y at all. This is about how we got
to where we are today. And whether we are making right decisions that
will benefit our users in the long run.
20 years ago there were several device-specific SCSI multipath drivers
available for Linux. All of them out-of-tree because there was no good
way to consolidate them. They all worked in very different ways because
the devices themselves were implemented in very different ways. It was a
nightmare.
At the time we were very proud of our block layer, an abstraction none
of the other operating systems really had. And along came Ingo and
Miguel and did a PoC MD multipath implementation for devices that didn't
have special needs. It was small, beautiful, and fit well into our shiny
block layer abstraction. And therefore everyone working on Linux storage
at the time was convinced that the block layer multipath model was the
right way to go. Including, I must emphasize, yours truly.
There were several reasons why the block + userland model was especially
compelling:
1. There were no device serial numbers, UUIDs, or VPD pages. So short
of disk labels, there was no way to automatically establish that block
device sda was in fact the same LUN as sdb. MD and DM were existing
vehicles for describing block device relationships. Either via on-disk
metadata or config files and device mapper tables. And system
configurations were simple and static enough then that manually
maintaining a config file wasn't much of a burden.
2. There was lots of talk in the industry about devices supporting
heterogeneous multipathing. As in ATA on one port and SCSI on the
other. So we deliberately did not want to put multipathing in SCSI,
anticipating that these hybrid devices might show up (this was in the
IDE days, obviously, predating libata sitting under SCSI). We made
several design compromises wrt. SCSI devices to accommodate future
coexistence with ATA. Then iSCSI came along and provided a "cheaper
than FC" solution and everybody instantly lost interest in ATA
multipath.
3. The devices at the time needed all sorts of custom knobs to
function. Path checkers, load balancing algorithms, explicit failover,
etc. We needed a way to run arbitrary, potentially proprietary,
commands from to initiate failover and failback. Absolute no-go for the
kernel so userland it was.
Those are some of the considerations that went into the original MD/DM
multipath approach. Everything made lots of sense at the time. But
obviously the industry constantly changes, things that were once
important no longer matter. Some design decisions were made based on
incorrect assumptions or lack of experience and we ended up with major
ad-hoc workarounds to the originally envisioned approach. SCSI device
handlers are the prime examples of how the original transport-agnostic
model didn't quite cut it. Anyway. So here we are. Current DM multipath
is a result of a whole string of design decisions, many of which are
based on assumptions that were valid at the time but which are no longer
relevant today.
ALUA came along in an attempt to standardize all the proprietary device
interactions, thus obsoleting the userland plugin requirement. It also
solved the ID/discovery aspect as well as provided a way to express
fault domains. The main problem with ALUA was that it was too
permissive, letting storage vendors get away with very suboptimal, yet
compliant, implementations based on their older, proprietary multipath
architectures. So we got the knobs standardized, but device behavior was
still all over the place.
Now enter NVMe. The industry had a chance to clean things up. No legacy
architectures to accommodate, no need for explicit failover, twiddling
mode pages, reading sector 0, etc. The rationale behind ANA is for
multipathing to work without any of the explicit configuration and
management hassles which riddle SCSI devices for hysterical raisins.
My objection to DM vs. NVMe enablement is that I think that the two
models are a very poor fit (manually configured individual block device
mapping vs. automatic grouping/failover above and below subsystem
level). On top of that, no compelling technical reason has been offered
for why DM multipath is actually a benefit. Nobody enjoys pasting WWNs
or IQNs into multipath.conf to get things working. And there is no flag
day/transition path requirement for devices that (with very few
exceptions) don't actually exist yet.
So I really don't understand why we must pound a square peg into a round
hole. NVMe is a different protocol. It is based on several decades of
storage vendor experience delivering products. And the protocol tries to
avoid the most annoying pitfalls and deficiencies from the SCSI past. DM
multipath made a ton of sense when it was conceived, and it continues to
serve its purpose well for many classes of devices. That does not
automatically imply that it is an appropriate model for *all* types of
devices, now and in the future. ANA is a deliberate industry departure
from the pre-ALUA SCSI universe that begat DM multipath.
So let's have a rational, technical discussion about what the use cases
are that would require deviating from the "hands off" aspect of ANA.
What is it DM can offer that isn't or can't be handled by the ANA code
in NVMe? What is it that must go against the grain of what the storage
vendors are trying to achieve with ANA?
--
Martin K. Petersen Oracle Linux Engineering
On Fri, Jun 01 2018 at 10:09am -0400,
Martin K. Petersen <[email protected]> wrote:
>
> Good morning Mike,
>
> > This notion that only native NVMe multipath can be successful is utter
> > bullshit. And the mere fact that I've gotten such a reaction from a
> > select few speaks to some serious control issues.
>
> Please stop making this personal.
It cuts both ways, but I agree.
> > Imagine if XFS developers just one day imposed that it is the _only_
> > filesystem that can be used on persistent memory.
>
> It's not about project X vs. project Y at all. This is about how we got
> to where we are today. And whether we are making right decisions that
> will benefit our users in the long run.
>
> 20 years ago there were several device-specific SCSI multipath drivers
> available for Linux. All of them out-of-tree because there was no good
> way to consolidate them. They all worked in very different ways because
> the devices themselves were implemented in very different ways. It was a
> nightmare.
>
> At the time we were very proud of our block layer, an abstraction none
> of the other operating systems really had. And along came Ingo and
> Miguel and did a PoC MD multipath implementation for devices that didn't
> have special needs. It was small, beautiful, and fit well into our shiny
> block layer abstraction. And therefore everyone working on Linux storage
> at the time was convinced that the block layer multipath model was the
> right way to go. Including, I must emphasize, yours truly.
>
> There were several reasons why the block + userland model was especially
> compelling:
>
> 1. There were no device serial numbers, UUIDs, or VPD pages. So short
> of disk labels, there was no way to automatically establish that block
> device sda was in fact the same LUN as sdb. MD and DM were existing
> vehicles for describing block device relationships. Either via on-disk
> metadata or config files and device mapper tables. And system
> configurations were simple and static enough then that manually
> maintaining a config file wasn't much of a burden.
>
> 2. There was lots of talk in the industry about devices supporting
> heterogeneous multipathing. As in ATA on one port and SCSI on the
> other. So we deliberately did not want to put multipathing in SCSI,
> anticipating that these hybrid devices might show up (this was in the
> IDE days, obviously, predating libata sitting under SCSI). We made
> several design compromises wrt. SCSI devices to accommodate future
> coexistence with ATA. Then iSCSI came along and provided a "cheaper
> than FC" solution and everybody instantly lost interest in ATA
> multipath.
>
> 3. The devices at the time needed all sorts of custom knobs to
> function. Path checkers, load balancing algorithms, explicit failover,
> etc. We needed a way to run arbitrary, potentially proprietary,
> commands from to initiate failover and failback. Absolute no-go for the
> kernel so userland it was.
>
> Those are some of the considerations that went into the original MD/DM
> multipath approach. Everything made lots of sense at the time. But
> obviously the industry constantly changes, things that were once
> important no longer matter. Some design decisions were made based on
> incorrect assumptions or lack of experience and we ended up with major
> ad-hoc workarounds to the originally envisioned approach. SCSI device
> handlers are the prime examples of how the original transport-agnostic
> model didn't quite cut it. Anyway. So here we are. Current DM multipath
> is a result of a whole string of design decisions, many of which are
> based on assumptions that were valid at the time but which are no longer
> relevant today.
>
> ALUA came along in an attempt to standardize all the proprietary device
> interactions, thus obsoleting the userland plugin requirement. It also
> solved the ID/discovery aspect as well as provided a way to express
> fault domains. The main problem with ALUA was that it was too
> permissive, letting storage vendors get away with very suboptimal, yet
> compliant, implementations based on their older, proprietary multipath
> architectures. So we got the knobs standardized, but device behavior was
> still all over the place.
>
> Now enter NVMe. The industry had a chance to clean things up. No legacy
> architectures to accommodate, no need for explicit failover, twiddling
> mode pages, reading sector 0, etc. The rationale behind ANA is for
> multipathing to work without any of the explicit configuration and
> management hassles which riddle SCSI devices for hysterical raisins.
Nice recap for those who aren't aware of the past (decision tree and
considerations that influenced the design of DM multipath).
> My objection to DM vs. NVMe enablement is that I think that the two
> models are a very poor fit (manually configured individual block device
> mapping vs. automatic grouping/failover above and below subsystem
> level). On top of that, no compelling technical reason has been offered
> for why DM multipath is actually a benefit. Nobody enjoys pasting WWNs
> or IQNs into multipath.conf to get things working. And there is no flag
> day/transition path requirement for devices that (with very few
> exceptions) don't actually exist yet.
>
> So I really don't understand why we must pound a square peg into a round
> hole. NVMe is a different protocol. It is based on several decades of
> storage vendor experience delivering products. And the protocol tries to
> avoid the most annoying pitfalls and deficiencies from the SCSI past. DM
> multipath made a ton of sense when it was conceived, and it continues to
> serve its purpose well for many classes of devices. That does not
> automatically imply that it is an appropriate model for *all* types of
> devices, now and in the future. ANA is a deliberate industry departure
> from the pre-ALUA SCSI universe that begat DM multipath.
>
> So let's have a rational, technical discussion about what the use cases
> are that would require deviating from the "hands off" aspect of ANA.
> What is it DM can offer that isn't or can't be handled by the ANA code
> in NVMe? What is it that must go against the grain of what the storage
> vendors are trying to achieve with ANA?
Really it boils down to how do users pivot to making use of native NVMe
multipath? By "pivot" I mean these users have multipath experience.
They have dealt with all the multipath.conf and dm-multipath quirks.
They know how to diagnose and monitor with these tools. They have their
own scripts and automation to manage the complexity. In addition, the
dm-multipath model of consuming other linux block devices implies users
have full visibility into IO performance across the entire dm-multipath
stack.
So the biggest failing for native NVMe multipath at this moment: there
is no higherlevel equivalent API for multipath state and performance
monitoring. And I'm not faulting anyone on the NVMe side for this. I
know how software development works. The fundamentals need to be
development before the luxury of higher level APIs and tools development
can make progress.
That said, I think we _do_ need to have a conversation about the current
capabilities of NVMe (and nvme cli) relative to piercing through the
toplevel native NVMe multipath device to really allow a user to "trust
but verify" all is behaving as it should.
So, how do/will native NVMe users:
1) know that a path is down/up (or even a larger subset of the fabric)?
- coupling this info with topology graphs is useful
2) know the performance of each disparate path (with no path selectors
at the moment it is moot, but it will become an issue)
It is tough to know the end from the beginning. And I think you and
others would agree we're basically still in native NVMe multipath's
beginning (might not feel like it given all the hard work that has been
done with the NVMe TWIG, etc). So given things are still so "green" I'd
imagine you can easily see why distro vendors like Red Hat and SUSE are
looking at this and saying "welp, native NVMe multipath isn't ready,
what are we going to do?".
And given there is so much vendor and customer expertise with
dm-multipath you can probably also see why a logical solution is to
try to enable NVMe multipath _with_ ANA in terms of dm-multipath... to
help us maintain interfaces customers have come to expect.
So dm-multipath is thought as a stop-gap to allow users to use existing
toolchains and APIs (which native NVMe multipath is completely lacking).
I get why that pains Christoph, yourself and others. I'm not liking it
either believe me!
Mike
> I'm aware that most everything in multipath.conf is SCSI/FC specific.
> That isn't the point. dm-multipath and multipathd are an existing
> framework for managing multipath storage.
>
> It could be made to work with NVMe. But yes it would not be easy.
> Especially not with the native NVMe multipath crew being so damn
> hostile.
The resistance is not a hostile act. Please try and keep the
discussion technical.
>> But I don't think the burden of allowing multipathd/DM to inject
>> themselves into the path transition state machine has any benefit
>> whatsoever to the user. It's only complicating things and therefore we'd
>> be doing people a disservice rather than a favor.
>
> This notion that only native NVMe multipath can be successful is utter
> bullshit. And the mere fact that I've gotten such a reaction from a
> select few speaks to some serious control issues.
>
> Imagine if XFS developers just one day imposed that it is the _only_
> filesystem that can be used on persistent memory.
>
> Just please dial it back.. seriously tiresome.
Mike, you make a fair point on multipath tools being more mature
compared to NVMe multipathing. But this is not the discussion at all (at
least not from my perspective). There was not a single use-case that
gave a clear-cut justification for a per-subsystem personality switch
(other than some far fetched imaginary scenarios). This is not unusual
for the kernel community not to accept things with little to no use,
especially when it involves exposing a userspace ABI.
As for now, all I see is a disclaimer saying that it'd need to be
nurtured over time as the NVMe spec evolves.
Can you (or others) please try and articulate why a "fine grained"
multipathing is an absolute must? At the moment, I just don't
understand.
Also, I get your point that exposing state/stats information to
userspace is needed. That's a fair comment.
On Sun, Jun 03 2018 at 7:00P -0400,
Sagi Grimberg <[email protected]> wrote:
>
> >I'm aware that most everything in multipath.conf is SCSI/FC specific.
> >That isn't the point. dm-multipath and multipathd are an existing
> >framework for managing multipath storage.
> >
> >It could be made to work with NVMe. But yes it would not be easy.
> >Especially not with the native NVMe multipath crew being so damn
> >hostile.
>
> The resistance is not a hostile act. Please try and keep the
> discussion technical.
This projecting onto me that I've not been keeping the conversation
technical is in itself hostile. Sure I get frustrated and lash out (as
I'm _sure_ you'll feel in this reply) but I've been beating my head
against the wall on the need for native NVMe multipath and dm-multipath
to coexist in a fine-grained manner for literally 2 years!
But for the time-being I was done dwelling on the need for a switch like
mpath_personality. Yet you persist. If you read the latest messages in
this thread [1] and still elected to send this message, then _that_ is a
hostile act. Because I have been nothing but informative. The fact you
choose not to care, appreciate or have concern for users' experience
isn't my fault.
And please don't pretend like the entire evolution of native NVMe
multipath was anything but one elaborate hostile act against
dm-multipath. To deny that would simply discredit your entire
viewpoint on this topic.
Even smaller decisions that were communicated in person and then later
unilaterally reversed were hostile. Examples:
1) ANA would serve as a scsi device handler like (multipath agnostic)
feature to enhance namespaces -- now you can see in the v2
implemation that certainly isn't the case
2) The dm-multipath path-selectors were going to be elevated for use by
both native NVMe multipath and dm-multipath -- now people are
implementing yet another round-robin path selector directly in NVMe.
I get it, Christoph (and others by association) are operating from a
"winning" position that was hostiley taken and now the winning position
is being leveraged to further ensure dm-multipath has no hope of being a
viable alternative to native NVMe multipath -- at least not without a
lot of work to refactor code to be unnecessarily homed in the
CONFIG_NVME_MULTIPATH=y sandbox.
> >>But I don't think the burden of allowing multipathd/DM to inject
> >>themselves into the path transition state machine has any benefit
> >>whatsoever to the user. It's only complicating things and therefore we'd
> >>be doing people a disservice rather than a favor.
> >
> >This notion that only native NVMe multipath can be successful is utter
> >bullshit. And the mere fact that I've gotten such a reaction from a
> >select few speaks to some serious control issues.
> >
> >Imagine if XFS developers just one day imposed that it is the _only_
> >filesystem that can be used on persistent memory.
> >
> >Just please dial it back.. seriously tiresome.
>
> Mike, you make a fair point on multipath tools being more mature
> compared to NVMe multipathing. But this is not the discussion at all (at
> least not from my perspective). There was not a single use-case that
> gave a clear-cut justification for a per-subsystem personality switch
> (other than some far fetched imaginary scenarios). This is not unusual
> for the kernel community not to accept things with little to no use,
> especially when it involves exposing a userspace ABI.
The interfaces dm-multipath and multipath-tools provide are exactly the
issue. SO which is it, do I have a valid usecase, like you indicated
before [2] or am I just talking non-sense (with hypotehticals because I
was baited to do so)? NOTE: even in your [2] reply you also go on to
say that "no one is forbidden to use [dm-]multipath." when the reality
is users will be as-is.
If you and others genuinely think that disallowing dm-multipath from
being able to manage NVMe devices if CONFIG_NVME_MULTIPATH is enabled
(and not shutoff via nvme_core.multipath=N) is a reasonable action then
you're actively complicit in limiting users from continuing to use the
long-established dm-multipath based infrastructure that Linux has had
for over 10 years.
There is literally no reason why they need to be mutually exclussive
(other than to grant otherwise would errode the "winning" position hch
et al have been operating from).
The implemetation of the switch to allow fine-grained control does need
proper care and review and buy-in. But I'm sad to see there literally
is zero willingness to even acknowledge that it is "the right thing to
do".
> As for now, all I see is a disclaimer saying that it'd need to be
> nurtured over time as the NVMe spec evolves.
>
> Can you (or others) please try and articulate why a "fine grained"
> multipathing is an absolute must? At the moment, I just don't
> understand.
Already made the point multiple times in this thread [3][4][5][1].
Hint: it is about the users who have long-standing expertise and
automation built around dm-multipath and multipath-tools. BUT those
same users may need/want to simultaneously use native NVMe multipath on
the same host. Dismissing this point or acting like I haven't
articulated it just illustrates to me continuing this conversation is
not going to be fruitful.
Mike
[1] https://lkml.org/lkml/2018/6/1/562
[2] https://lkml.org/lkml/2018/5/31/175
[3] https://lkml.org/lkml/2018/5/29/230
[4] https://lkml.org/lkml/2018/5/29/1260
[5] https://lkml.org/lkml/2018/5/31/707
On Wed, 30 May 2018 13:05:46 -0600
Jens Axboe <[email protected]> wrote:
> On 5/29/18 5:27 PM, Mike Snitzer wrote:
> > On Tue, May 29 2018 at 4:09am -0400,
> > Christoph Hellwig <[email protected]> wrote:
> >
> >> On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn
> >> wrote:
> >>> For a "Plan B" we can still use the global knob that's already in
> >>> place (even if this reminds me so much about scsi-mq which at
> >>> least we haven't turned on in fear of performance regressions).
> >>>
> >>> Let's drop the discussion here, I don't think it leads to
> >>> something else than flamewars.
> >>
> >> If our plan A doesn't work we can go back to these patches. For
> >> now I'd rather have everyone spend their time on making Plan A
> >> work then preparing for contingencies. Nothing prevents anyone
> >> from using these patches already out there if they really want to,
> >> but I'd recommend people are very careful about doing so as you'll
> >> lock yourself into a long-term maintainance burden.
> >
> > Restating (for others): this patchset really isn't about
> > contingencies. It is about choice.
> >
> > Since we're at an impasse, in the hopes of soliciting definitive
> > feedback from Jens and Linus, I'm going to attempt to reset the
> > discussion for their entry.
> >
> > In summary, we have a classic example of a maintainer stalemate
> > here: 1) Christoph, as NVMe co-maintainer, doesn't want to allow
> > native NVMe multipath to actively coexist with dm-multipath's NVMe
> > support on the same host.
> > 2) I, as DM maintainer, would like to offer this flexibility to
> > users -- by giving them opt-in choice to continue using existing
> > dm-multipath with NVMe. (also, both Red Hat and SUSE would like to
> > offer this).
> >
> > There is no technical reason why they cannot coexist. Hence this
> > simple patchset that was originally offered by Johannes Thumshirn
> > with contributions from myself.
>
> Here's what I think - flag days tend to suck. They may be more
> convenient for developers, but they inflict pain on users. Sometimes
> they prevent them from moving forward, since updates are now gated on
> external dependencies. Moving forward with a new architecture is
> great, but proper care has to be given to existing users of
> multipath, regardless of how few they may be.
>
> This patchset seems pretty clean and minimalist. Realistically, I'm
> guessing that SUSE and RH will ship it regardless of upstream status.
>
Without it we're having a choice of disappointing (paying) customers or
disappointing the upstream community.
Guess.
Cheers,
Hannes
On Mon, Jun 04, 2018 at 08:19:21AM +0200, Hannes Reinecke wrote:
> Without it we're having a choice of disappointing (paying) customers or
> disappointing the upstream community.
I personally think (regardless of the fact, that I wrote the patch
under discussion), using the module parameter is sufficient for these
kind of customers. For me it's an either/or kind of setting (either
native or dm-mpath).
Downstream distributions could still carry a small patch flipping the
default to off if they want to maintain backwards compatibility with
existing dm-mpath setups (which for NVMe I doubt there are many!).
What we really should do is, try to give multipath-tools a 'nvme
list-subsys' like view of nvme native multipathing (and I think Martin
W. has already been looking into this a while ago).
Johannes
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[so much for putting out flames... :/]
> This projecting onto me that I've not been keeping the conversation
> technical is in itself hostile. Sure I get frustrated and lash out (as
> I'm _sure_ you'll feel in this reply)
You're right, I do feel this is lashing out. And I don't appreciate it.
Please stop it. We're not going to make progress otherwise.
>> Can you (or others) please try and articulate why a "fine grained"
>> multipathing is an absolute must? At the moment, I just don't
>> understand.
>
> Already made the point multiple times in this thread [3][4][5][1].
> Hint: it is about the users who have long-standing expertise and
> automation built around dm-multipath and multipath-tools. BUT those
> same users may need/want to simultaneously use native NVMe multipath on
> the same host. Dismissing this point or acting like I haven't
> articulated it just illustrates to me continuing this conversation is
> not going to be fruitful.
The vast majority of the points are about the fact that people still
need to be able to use multipath-tools, which they still can today.
Personally, I question the existence of this user base you are referring
to which would want to maintain both dm-multipath and nvme personalities
at the same time on the same host. But I do want us to make progress, so
I will have take this need as a given.
I agree with Christoph that changing personality on the fly is going to
be painful. This opt-in will need to be one-host at connect time. For
that, we will probably need to also expose an argument in nvme-cli too.
Changing the mpath personality will need to involve disconnecting the
controller and connecting again with the argument toggled. I think this
is the only sane way to do this.
Another path we can make progress in is user visibility. We have
topology in place and you mentioned primary path (which we could
probably add). What else do you need for multipath-tools to support
nvme?
On Mon, Jun 04, 2018 at 02:46:47PM +0300, Sagi Grimberg wrote:
> I agree with Christoph that changing personality on the fly is going to
> be painful. This opt-in will need to be one-host at connect time. For
> that, we will probably need to also expose an argument in nvme-cli too.
> Changing the mpath personality will need to involve disconnecting the
> controller and connecting again with the argument toggled. I think this
> is the only sane way to do this.
If we still want to make it dynamically, yes. I've raised this concern
while working on the patch as well.
> Another path we can make progress in is user visibility. We have
> topology in place and you mentioned primary path (which we could
> probably add). What else do you need for multipath-tools to support
> nvme?
I think the first priority is getting nvme notion into multipath-tools
like I said elsewhere and then see. Martin Wilck was already working
on patches for this.
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
On Mon, Jun 04, 2018 at 09:18:29AM +0200, Johannes Thumshirn wrote:
> What we really should do is, try to give multipath-tools a 'nvme
> list-subsys' like view of nvme native multipathing (and I think Martin
> W. has already been looking into this a while ago).
Which has been merged into multipath-tools a while ago:
https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commit;h=86553b57b6bd55e0355ac27ae100cce6cc42bee3
On Mon, Jun 04 2018 at 8:59am -0400,
Christoph Hellwig <[email protected]> wrote:
> On Mon, Jun 04, 2018 at 09:18:29AM +0200, Johannes Thumshirn wrote:
> > What we really should do is, try to give multipath-tools a 'nvme
> > list-subsys' like view of nvme native multipathing (and I think Martin
> > W. has already been looking into this a while ago).
>
> Which has been merged into multipath-tools a while ago:
>
> https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commit;h=86553b57b6bd55e0355ac27ae100cce6cc42bee3
And this is what I heard from Ben Marzinski last week:
Yeah. Things like
multipath -l
multipathd show maps
multipathd show paths
are supported. There is no support for the individual "show map" and
"show path" commands. Those only work on dm devices. There is also no
json formatting option for the foreign devices, but that could be added.
And probably will need to be, since people like RHEV really want to
use the library interface to multipathd with json formatted output.
Although if there are no dm multipath devices, there is no point to
running multipathd, so it might be worthwhile thinking about a library
interface to getting the information directly, instead of through
multipathd. That's a bigger rewrite.
> Moreover, I also wanted to point out that fabrics array vendors are
> building products that rely on standard nvme multipathing (and probably
> multipathing over dispersed namespaces as well), and keeping a knob that
> will keep nvme users with dm-multipath will probably not help them
> educate their customers as well... So there is another angle to this.
As a vendor who is building an NVMe-oF storage array, I can say that
clarity around how Linux wants to handle NVMe multipath would
definitely be appreciated. It would be great if we could all converge
around the upstream native driver but right now it doesn't look
adequate - having only a single active path is not the best way to use
a multi-controller storage system. Unfortunately it looks like we're
headed to a world where people have to write separate "best practices"
documents to cover RHEL, SLES and other vendors.
We plan to implement all the fancy NVMe standards like ANA, but it
seems that there is still a requirement to let the host side choose
policies about how to use paths (round-robin vs least queue depth for
example). Even in the modern SCSI world with VPD pages and ALUA,
there are still knobs that are needed. Maybe NVMe will be different
and we can find defaults that work in all cases but I have to admit
I'm skeptical...
- R.
On Mon, Jun 04, 2018 at 02:58:49PM -0700, Roland Dreier wrote:
> We plan to implement all the fancy NVMe standards like ANA, but it
> seems that there is still a requirement to let the host side choose
> policies about how to use paths (round-robin vs least queue depth for
> example). Even in the modern SCSI world with VPD pages and ALUA,
> there are still knobs that are needed. Maybe NVMe will be different
> and we can find defaults that work in all cases but I have to admit
> I'm skeptical...
The sensible thing to do in nvme is to use different paths for
different queues. That is e.g. in the RDMA case use the HCA closer
to a given CPU by default. We might allow to override this for
cases where the is a good reason, but what I really don't want is
configurability for configurabilities sake.
> The sensible thing to do in nvme is to use different paths for
> different queues. That is e.g. in the RDMA case use the HCA closer
> to a given CPU by default. We might allow to override this for
> cases where the is a good reason, but what I really don't want is
> configurability for configurabilities sake.
That makes sense but I'm not sure it covers everything. Probably the
most common way to do NVMe/RDMA will be with a single HCA that has
multiple ports, so there's no sensible CPU locality. On the other
hand we want to keep both ports to the fabric busy. Setting different
paths for different queues makes sense, but there may be
single-threaded applications that want a different policy.
I'm not saying anything very profound, but we have to find the right
balance between too many and too few knobs.
- R.
>> We plan to implement all the fancy NVMe standards like ANA, but it
>> seems that there is still a requirement to let the host side choose
>> policies about how to use paths (round-robin vs least queue depth for
>> example). Even in the modern SCSI world with VPD pages and ALUA,
>> there are still knobs that are needed. Maybe NVMe will be different
>> and we can find defaults that work in all cases but I have to admit
>> I'm skeptical...
>
> The sensible thing to do in nvme is to use different paths for
> different queues.
Huh? different paths == different controllers so this sentence can't
be right... you mean that a path selector will select a controller
based on the home node of the local rdma device connecting to it and
the running cpu right?
On Wed, Jun 06, 2018 at 12:32:21PM +0300, Sagi Grimberg wrote:
> Huh? different paths == different controllers so this sentence can't
> be right... you mean that a path selector will select a controller
> based on the home node of the local rdma device connecting to it and
> the running cpu right?
Think of a system with say 8 cpu cores. Say we have two optimized
paths.
There is no point in going round robin or service time over the
two paths for each logic pre-cpu queue. Instead we should always
got to path A for a given cpu queue or path B to reduce selection
overhead and cache footprint.
On Tue, Jun 05, 2018 at 03:57:05PM -0700, Roland Dreier wrote:
> That makes sense but I'm not sure it covers everything. Probably the
> most common way to do NVMe/RDMA will be with a single HCA that has
> multiple ports, so there's no sensible CPU locality. On the other
> hand we want to keep both ports to the fabric busy. Setting different
> paths for different queues makes sense, but there may be
> single-threaded applications that want a different policy.
>
> I'm not saying anything very profound, but we have to find the right
> balance between too many and too few knobs.
Agreed. And the philosophy here is to start with a as few knobs
as possible and work from there based on actual use cases.
Single threaded applications will run into issues with general
blk-mq philosophy, so to work around that we'll need to dig deeper
and allow borrowing of other cpu queues if we want to cater for that.