2024-05-20 20:21:07

by John Meneghini

[permalink] [raw]
Subject: [PATCH v3 0/1] nvme: queue-depth multipath iopolicy

Submitting for final review. As agreed at LSFMM I've squashed this series into
one patch and addressed all review comments. Please merge this into nvme-6.10.

Changes since V2:

Add the NVME_MPATH_CNT_ACTIVE flag to eliminate a READ_ONCE in the completion path
and increment/decrement the active_nr count on all mpath IOs - including
passthru commands.

Send a pr_notice when ever the iopolicy on a subsystem is changed. This is
important for support reasons. It is fully expected that users will be changing
the iopolicy with active IO in progress.

Squashed everything and rebased to nvme-v6.10

Changes since V1:

I'm re-issuing Ewan's queue-depth patches in preparation for LSFMM

These patches were first show at ALPSS 2023 where I shared the following
graphs which measure the IO distribution across 4 active-optimized
controllers using the round-robin verses queue-depth iopolicy.

https://people.redhat.com/jmeneghi/ALPSS_2023/NVMe_QD_Multipathing.pdf

Since that time we have continued testing these patches with a number of
different nvme-of storage arrays and test bed configurations, and I've codified
the tests and methods we use to measure IO distribution

All of my test results, together with the scripts I used to generate these
graphs, are available at:

https://github.com/johnmeneghini/iopolicy

Please use the scripts in this repository to do your own testing.

These patches are based on nvme-v6.9

Ewan D. Milne (1):
nvme: multipath: Implemented new iopolicy "queue-depth"

drivers/nvme/host/core.c | 2 +-
drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
drivers/nvme/host/nvme.h | 9 ++++
3 files changed, 92 insertions(+), 5 deletions(-)

--
2.39.3



2024-05-20 20:21:08

by John Meneghini

[permalink] [raw]
Subject: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

From: "Ewan D. Milne" <[email protected]>

The round-robin path selector is inefficient in cases where there is a
difference in latency between multiple active optimized paths. In the
presence of one or more high latency paths the round-robin selector
continues to the high latency path equally. This results in a bias
towards the highest latency path and can cause is significant decrease
in overall performance as IOs pile on the lowest latency path. This
problem is particularly accute with NVMe-oF controllers.

The queue-depth policy instead sends I/O requests down the path with the
least amount of requests in its request queue. Paths with lower latency
will clear requests more quickly and have less requests in their queues
compared to higher latency paths. The goal of this path selector is to
make more use of lower latency paths, which will bring down overall IO
latency.

Signed-off-by: Ewan D. Milne <[email protected]>
[tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
and compilation warnings, updated MODULE_PARM description, and
fixed potential issue with ->current_path[] being used]
Signed-off-by: Thomas Song <[email protected]>
[jmeneghi: vairious changes and improvements, addressed review comments]
Signed-off-by: John Meneghini <[email protected]>
Link: https://lore.kernel.org/linux-nvme/[email protected]/
Tested-by: Marco Patalano <[email protected]>
Reviewed-by: Randy Jennings <[email protected]>
Tested-by: Jyoti Rani <[email protected]>
---
drivers/nvme/host/core.c | 2 +-
drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
drivers/nvme/host/nvme.h | 9 ++++
3 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a066429b790d..1dd7c52293ff 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
EXPORT_SYMBOL_GPL(nvme_delete_wq);

static LIST_HEAD(nvme_subsystems);
-static DEFINE_MUTEX(nvme_subsystems_lock);
+DEFINE_MUTEX(nvme_subsystems_lock);

static DEFINE_IDA(nvme_instance_ida);
static dev_t nvme_ctrl_base_chr_devt;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 5397fb428b24..0e2b6e720e95 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
static const char *nvme_iopolicy_names[] = {
[NVME_IOPOLICY_NUMA] = "numa",
[NVME_IOPOLICY_RR] = "round-robin",
+ [NVME_IOPOLICY_QD] = "queue-depth",
};

static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
iopolicy = NVME_IOPOLICY_NUMA;
else if (!strncmp(val, "round-robin", 11))
iopolicy = NVME_IOPOLICY_RR;
+ else if (!strncmp(val, "queue-depth", 11))
+ iopolicy = NVME_IOPOLICY_QD;
else
return -EINVAL;

@@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
&iopolicy, 0644);
MODULE_PARM_DESC(iopolicy,
- "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
+ "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");

void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
{
@@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
struct nvme_ns *ns = rq->q->queuedata;
struct gendisk *disk = ns->head->disk;

+ if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
+ atomic_inc(&ns->ctrl->nr_active);
+ nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
+ }
+
if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
return;

@@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
{
struct nvme_ns *ns = rq->q->queuedata;

+ if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
+ atomic_dec_if_positive(&ns->ctrl->nr_active);
+
if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
return;
+
bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
blk_rq_bytes(rq) >> SECTOR_SHIFT,
nvme_req(rq)->start_time);
@@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
return found;
}

+static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
+{
+ struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+ unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+ unsigned int depth;
+
+ list_for_each_entry_rcu(ns, &head->list, siblings) {
+ if (nvme_path_is_disabled(ns))
+ continue;
+
+ depth = atomic_read(&ns->ctrl->nr_active);
+
+ switch (ns->ana_state) {
+ case NVME_ANA_OPTIMIZED:
+ if (depth < min_depth_opt) {
+ min_depth_opt = depth;
+ best_opt = ns;
+ }
+ break;
+
+ case NVME_ANA_NONOPTIMIZED:
+ if (depth < min_depth_nonopt) {
+ min_depth_nonopt = depth;
+ best_nonopt = ns;
+ }
+ break;
+ default:
+ break;
+ }
+ }
+
+ return best_opt ? best_opt : best_nonopt;
+}
+
static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
{
return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
@@ -338,15 +384,27 @@ static inline bool nvme_path_is_optimized(struct nvme_ns *ns)

inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
{
- int node = numa_node_id();
+ int iopolicy = READ_ONCE(head->subsys->iopolicy);
+ int node;
struct nvme_ns *ns;

+ /*
+ * queue-depth iopolicy does not need to reference ->current_path
+ * but round-robin needs the last path used to advance to the
+ * next one, and numa will continue to use the last path unless
+ * it is or has become not optimized
+ */
+ if (iopolicy == NVME_IOPOLICY_QD)
+ return nvme_queue_depth_path(head);
+
+ node = numa_node_id();
ns = srcu_dereference(head->current_path[node], &head->srcu);
if (unlikely(!ns))
return __nvme_find_path(head, node);

- if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
+ if (iopolicy == NVME_IOPOLICY_RR)
return nvme_round_robin_path(head, node, ns);
+
if (unlikely(!nvme_path_is_optimized(ns)))
return __nvme_find_path(head, node);
return ns;
@@ -798,6 +856,25 @@ static ssize_t nvme_subsys_iopolicy_show(struct device *dev,
nvme_iopolicy_names[READ_ONCE(subsys->iopolicy)]);
}

+void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
+{
+ struct nvme_ctrl *ctrl;
+ int old_iopolicy = READ_ONCE(subsys->iopolicy);
+
+ WRITE_ONCE(subsys->iopolicy, iopolicy);
+
+ mutex_lock(&nvme_subsystems_lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ atomic_set(&ctrl->nr_active, 0);
+ nvme_mpath_clear_ctrl_paths(ctrl);
+ }
+ mutex_unlock(&nvme_subsystems_lock);
+
+ pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__,
+ nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy],
+ subsys->subnqn);
+}
+
static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
{
@@ -807,7 +884,7 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,

for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
- WRITE_ONCE(subsys->iopolicy, i);
+ nvme_subsys_iopolicy_update(subsys, i);
return count;
}
}
@@ -905,6 +982,7 @@ void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl)
mutex_init(&ctrl->ana_lock);
timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
INIT_WORK(&ctrl->ana_work, nvme_ana_work);
+ atomic_set(&ctrl->nr_active, 0);
}

int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f243a5822c2b..f5557889b244 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -50,6 +50,8 @@ extern struct workqueue_struct *nvme_wq;
extern struct workqueue_struct *nvme_reset_wq;
extern struct workqueue_struct *nvme_delete_wq;

+extern struct mutex nvme_subsystems_lock;
+
/*
* List of workarounds for devices that required behavior not specified in
* the standard.
@@ -190,6 +192,7 @@ enum {
NVME_REQ_CANCELLED = (1 << 0),
NVME_REQ_USERCMD = (1 << 1),
NVME_MPATH_IO_STATS = (1 << 2),
+ NVME_MPATH_CNT_ACTIVE = (1 << 3),
};

static inline struct nvme_request *nvme_req(struct request *req)
@@ -354,6 +357,7 @@ struct nvme_ctrl {
size_t ana_log_size;
struct timer_list anatt_timer;
struct work_struct ana_work;
+ atomic_t nr_active;
#endif

#ifdef CONFIG_NVME_HOST_AUTH
@@ -402,6 +406,7 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
enum nvme_iopolicy {
NVME_IOPOLICY_NUMA,
NVME_IOPOLICY_RR,
+ NVME_IOPOLICY_QD,
};

struct nvme_subsystem {
@@ -935,6 +940,7 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
void nvme_mpath_shutdown_disk(struct nvme_ns_head *head);
void nvme_mpath_start_request(struct request *rq);
void nvme_mpath_end_request(struct request *rq);
+void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy);

static inline void nvme_trace_bio_complete(struct request *req)
{
@@ -1034,6 +1040,9 @@ static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
{
return false;
}
+static inline void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
+{
+}
#endif /* CONFIG_NVME_MULTIPATH */

int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
--
2.39.3


2024-05-20 20:50:14

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On Mon, May 20, 2024 at 04:20:45PM -0400, John Meneghini wrote:
> From: "Ewan D. Milne" <[email protected]>
>
> The round-robin path selector is inefficient in cases where there is a
> difference in latency between multiple active optimized paths. In the
> presence of one or more high latency paths the round-robin selector
> continues to the high latency path equally. This results in a bias
> towards the highest latency path and can cause is significant decrease
> in overall performance as IOs pile on the lowest latency path. This
> problem is particularly accute with NVMe-oF controllers.

The patch looks pretty good to me. Just a few questions/comments.

> static LIST_HEAD(nvme_subsystems);
> -static DEFINE_MUTEX(nvme_subsystems_lock);
> +DEFINE_MUTEX(nvme_subsystems_lock);

This seems odd. Why is this lock protecting both the global
nvme_subsystems list, and also subsystem controllers? IOW, why isn't the
subsys->ctrls list protected by the more fine grained 'subsys->lock'
instead of this global lock?

> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
> &iopolicy, 0644);
> MODULE_PARM_DESC(iopolicy,
> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");

Unnecessary space before the ','.

> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
> + atomic_inc(&ns->ctrl->nr_active);
> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
> + }
> +
> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
> return;
>
> @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
> {
> struct nvme_ns *ns = rq->q->queuedata;
>
> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
> + atomic_dec_if_positive(&ns->ctrl->nr_active);

You can just do a atomic_dec() since your new flag has this tied to to
the atomic_inc().

> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> +{
> + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> + unsigned int depth;
> +
> + list_for_each_entry_rcu(ns, &head->list, siblings) {
> + if (nvme_path_is_disabled(ns))
> + continue;
> +
> + depth = atomic_read(&ns->ctrl->nr_active);
> +
> + switch (ns->ana_state) {
> + case NVME_ANA_OPTIMIZED:
> + if (depth < min_depth_opt) {
> + min_depth_opt = depth;
> + best_opt = ns;
> + }
> + break;
> +
> + case NVME_ANA_NONOPTIMIZED:
> + if (depth < min_depth_nonopt) {
> + min_depth_nonopt = depth;
> + best_nonopt = ns;
> + }
> + break;
> + default:
> + break;
> + }

Could we break out of this loop early if "min_depth_opt == 0"? We can't
find a better path that that, so no need to read the rest of the paths.

> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
> +{
> + struct nvme_ctrl *ctrl;
> + int old_iopolicy = READ_ONCE(subsys->iopolicy);
> +

Let's add a check here:

if (old_iopolicy == iopolicy)
return;

> @@ -935,6 +940,7 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head);
> void nvme_mpath_start_request(struct request *rq);
> void nvme_mpath_end_request(struct request *rq);
> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy);

This funciton isn't used outside multipath.c, so it should be static.

2024-05-21 06:46:53

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On 5/20/24 22:20, John Meneghini wrote:
> From: "Ewan D. Milne" <[email protected]>
>
> The round-robin path selector is inefficient in cases where there is a
> difference in latency between multiple active optimized paths. In the
> presence of one or more high latency paths the round-robin selector
> continues to the high latency path equally. This results in a bias
> towards the highest latency path and can cause is significant decrease
> in overall performance as IOs pile on the lowest latency path. This
> problem is particularly accute with NVMe-oF controllers.
>
> The queue-depth policy instead sends I/O requests down the path with the
> least amount of requests in its request queue. Paths with lower latency
> will clear requests more quickly and have less requests in their queues
> compared to higher latency paths. The goal of this path selector is to
> make more use of lower latency paths, which will bring down overall IO
> latency.
>
> Signed-off-by: Ewan D. Milne <[email protected]>
> [tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
> and compilation warnings, updated MODULE_PARM description, and
> fixed potential issue with ->current_path[] being used]
> Signed-off-by: Thomas Song <[email protected]>
> [jmeneghi: vairious changes and improvements, addressed review comments]
> Signed-off-by: John Meneghini <[email protected]>
> Link: https://lore.kernel.org/linux-nvme/[email protected]/
> Tested-by: Marco Patalano <[email protected]>
> Reviewed-by: Randy Jennings <[email protected]>
> Tested-by: Jyoti Rani <[email protected]>
> ---
> drivers/nvme/host/core.c | 2 +-
> drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
> drivers/nvme/host/nvme.h | 9 ++++
> 3 files changed, 92 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a066429b790d..1dd7c52293ff 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
> EXPORT_SYMBOL_GPL(nvme_delete_wq);
>
> static LIST_HEAD(nvme_subsystems);
> -static DEFINE_MUTEX(nvme_subsystems_lock);
> +DEFINE_MUTEX(nvme_subsystems_lock);
>
> static DEFINE_IDA(nvme_instance_ida);
> static dev_t nvme_ctrl_base_chr_devt;
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 5397fb428b24..0e2b6e720e95 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
> static const char *nvme_iopolicy_names[] = {
> [NVME_IOPOLICY_NUMA] = "numa",
> [NVME_IOPOLICY_RR] = "round-robin",
> + [NVME_IOPOLICY_QD] = "queue-depth",
> };
>
> static int iopolicy = NVME_IOPOLICY_NUMA;
> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
> iopolicy = NVME_IOPOLICY_NUMA;
> else if (!strncmp(val, "round-robin", 11))
> iopolicy = NVME_IOPOLICY_RR;
> + else if (!strncmp(val, "queue-depth", 11))
> + iopolicy = NVME_IOPOLICY_QD;
> else
> return -EINVAL;
>
> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
> &iopolicy, 0644);
> MODULE_PARM_DESC(iopolicy,
> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>
> void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
> {
> @@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
> struct nvme_ns *ns = rq->q->queuedata;
> struct gendisk *disk = ns->head->disk;
>
> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
> + atomic_inc(&ns->ctrl->nr_active);
> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
> + }
> +
> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
> return;
>
> @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
> {
> struct nvme_ns *ns = rq->q->queuedata;
>
> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
> + atomic_dec_if_positive(&ns->ctrl->nr_active);
> +
> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
> return;
> +

Pointless newline.

> bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
> blk_rq_bytes(rq) >> SECTOR_SHIFT,
> nvme_req(rq)->start_time);
> @@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> return found;
> }
>
> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> +{
> + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> + unsigned int depth;
> +
> + list_for_each_entry_rcu(ns, &head->list, siblings) {
> + if (nvme_path_is_disabled(ns))
> + continue;
> +
> + depth = atomic_read(&ns->ctrl->nr_active);
> +
> + switch (ns->ana_state) {
> + case NVME_ANA_OPTIMIZED:
> + if (depth < min_depth_opt) {
> + min_depth_opt = depth;
> + best_opt = ns;
> + }
> + break;
> +
> + case NVME_ANA_NONOPTIMIZED:
> + if (depth < min_depth_nonopt) {
> + min_depth_nonopt = depth;
> + best_nonopt = ns;
> + }
> + break;
> + default:
> + break;
> + }
> + }
> +
> + return best_opt ? best_opt : best_nonopt;
> +}
> +
> static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
> {
> return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
> @@ -338,15 +384,27 @@ static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
>
> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> {
> - int node = numa_node_id();
> + int iopolicy = READ_ONCE(head->subsys->iopolicy);
> + int node;
> struct nvme_ns *ns;
>
> + /*
> + * queue-depth iopolicy does not need to reference ->current_path
> + * but round-robin needs the last path used to advance to the
> + * next one, and numa will continue to use the last path unless
> + * it is or has become not optimized
> + */
> + if (iopolicy == NVME_IOPOLICY_QD)
> + return nvme_queue_depth_path(head);
> +
> + node = numa_node_id();
> ns = srcu_dereference(head->current_path[node], &head->srcu);
> if (unlikely(!ns))
> return __nvme_find_path(head, node);
>
> - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
> + if (iopolicy == NVME_IOPOLICY_RR)
> return nvme_round_robin_path(head, node, ns);
> +
> if (unlikely(!nvme_path_is_optimized(ns)))
> return __nvme_find_path(head, node);
> return ns;
> @@ -798,6 +856,25 @@ static ssize_t nvme_subsys_iopolicy_show(struct device *dev,
> nvme_iopolicy_names[READ_ONCE(subsys->iopolicy)]);
> }
>
> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
> +{
> + struct nvme_ctrl *ctrl;
> + int old_iopolicy = READ_ONCE(subsys->iopolicy);
> +
> + WRITE_ONCE(subsys->iopolicy, iopolicy);
> +
> + mutex_lock(&nvme_subsystems_lock);
> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> + atomic_set(&ctrl->nr_active, 0);
> + nvme_mpath_clear_ctrl_paths(ctrl);

You always reset the variables here, even if specified iopolicy is
the same than the currently active one.
I'd rather check if the iopolicy is different before changing the settings.

> + }
> + mutex_unlock(&nvme_subsystems_lock);
> +
> + pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__,
> + nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy],
> + subsys->subnqn);
> +}
> +
> static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t count)
> {
> @@ -807,7 +884,7 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
>
> for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
> if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
> - WRITE_ONCE(subsys->iopolicy, i);
> + nvme_subsys_iopolicy_update(subsys, i);
> return count;
> }
> }
> @@ -905,6 +982,7 @@ void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl)
> mutex_init(&ctrl->ana_lock);
> timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
> INIT_WORK(&ctrl->ana_work, nvme_ana_work);
> + atomic_set(&ctrl->nr_active, 0);
> }
>
> int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index f243a5822c2b..f5557889b244 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -50,6 +50,8 @@ extern struct workqueue_struct *nvme_wq;
> extern struct workqueue_struct *nvme_reset_wq;
> extern struct workqueue_struct *nvme_delete_wq;
>
> +extern struct mutex nvme_subsystems_lock;
> +
> /*
> * List of workarounds for devices that required behavior not specified in
> * the standard.
> @@ -190,6 +192,7 @@ enum {
> NVME_REQ_CANCELLED = (1 << 0),
> NVME_REQ_USERCMD = (1 << 1),
> NVME_MPATH_IO_STATS = (1 << 2),
> + NVME_MPATH_CNT_ACTIVE = (1 << 3),
> };
>
> static inline struct nvme_request *nvme_req(struct request *req)
> @@ -354,6 +357,7 @@ struct nvme_ctrl {
> size_t ana_log_size;
> struct timer_list anatt_timer;
> struct work_struct ana_work;
> + atomic_t nr_active;
> #endif
>
> #ifdef CONFIG_NVME_HOST_AUTH
> @@ -402,6 +406,7 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
> enum nvme_iopolicy {
> NVME_IOPOLICY_NUMA,
> NVME_IOPOLICY_RR,
> + NVME_IOPOLICY_QD,
> };
>
> struct nvme_subsystem {
> @@ -935,6 +940,7 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head);
> void nvme_mpath_start_request(struct request *rq);
> void nvme_mpath_end_request(struct request *rq);
> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy);
>
> static inline void nvme_trace_bio_complete(struct request *req)
> {
> @@ -1034,6 +1040,9 @@ static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
> {
> return false;
> }
> +static inline void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
> +{
> +}
> #endif /* CONFIG_NVME_MULTIPATH */
>
> int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-05-21 08:48:59

by Nilay Shroff

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"



On 5/21/24 01:50, John Meneghini wrote:
> From: "Ewan D. Milne" <[email protected]>
>
> The round-robin path selector is inefficient in cases where there is a
> difference in latency between multiple active optimized paths. In the
> presence of one or more high latency paths the round-robin selector
> continues to the high latency path equally. This results in a bias
> towards the highest latency path and can cause is significant decrease
> in overall performance as IOs pile on the lowest latency path. This
> problem is particularly accute with NVMe-oF controllers.
>
> The queue-depth policy instead sends I/O requests down the path with the
> least amount of requests in its request queue. Paths with lower latency
> will clear requests more quickly and have less requests in their queues
> compared to higher latency paths. The goal of this path selector is to
> make more use of lower latency paths, which will bring down overall IO
> latency.
>
> Signed-off-by: Ewan D. Milne <[email protected]>
> [tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
> and compilation warnings, updated MODULE_PARM description, and
> fixed potential issue with ->current_path[] being used]
> Signed-off-by: Thomas Song <[email protected]>
> [jmeneghi: vairious changes and improvements, addressed review comments]
> Signed-off-by: John Meneghini <[email protected]>
> Link: https://lore.kernel.org/linux-nvme/[email protected]/
> Tested-by: Marco Patalano <[email protected]>
> Reviewed-by: Randy Jennings <[email protected]>
> Tested-by: Jyoti Rani <[email protected]>
> ---
> drivers/nvme/host/core.c | 2 +-
> drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
> drivers/nvme/host/nvme.h | 9 ++++
> 3 files changed, 92 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a066429b790d..1dd7c52293ff 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
> EXPORT_SYMBOL_GPL(nvme_delete_wq);
>
> static LIST_HEAD(nvme_subsystems);
> -static DEFINE_MUTEX(nvme_subsystems_lock);
> +DEFINE_MUTEX(nvme_subsystems_lock);
>
> static DEFINE_IDA(nvme_instance_ida);
> static dev_t nvme_ctrl_base_chr_devt;
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 5397fb428b24..0e2b6e720e95 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
> static const char *nvme_iopolicy_names[] = {
> [NVME_IOPOLICY_NUMA] = "numa",
> [NVME_IOPOLICY_RR] = "round-robin",
> + [NVME_IOPOLICY_QD] = "queue-depth",
> };
>
> static int iopolicy = NVME_IOPOLICY_NUMA;
> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
> iopolicy = NVME_IOPOLICY_NUMA;
> else if (!strncmp(val, "round-robin", 11))
> iopolicy = NVME_IOPOLICY_RR;
> + else if (!strncmp(val, "queue-depth", 11))
> + iopolicy = NVME_IOPOLICY_QD;
> else
> return -EINVAL;
>
> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
> &iopolicy, 0644);
> MODULE_PARM_DESC(iopolicy,
> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>
> void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
> {
> @@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
> struct nvme_ns *ns = rq->q->queuedata;
> struct gendisk *disk = ns->head->disk;
>
> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
> + atomic_inc(&ns->ctrl->nr_active);
> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
> + }
> +
> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
> return;
>
> @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
> {
> struct nvme_ns *ns = rq->q->queuedata;
>
> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
> + atomic_dec_if_positive(&ns->ctrl->nr_active);
> +
> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
> return;
> +
> bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
> blk_rq_bytes(rq) >> SECTOR_SHIFT,
> nvme_req(rq)->start_time);
> @@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> return found;
> }
>
I think you may also want to reset nr_active counter if in case, in-flight nvme request
is cancelled. If the request is cancelled then nvme_mpath_end_request() wouldn't be invoked.
So you may want to reset nr_active counter from nvme_cancel_request() as below:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bf7615cb36ee..4fea7883ce8e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -497,8 +497,9 @@ EXPORT_SYMBOL_GPL(nvme_host_path_error);

bool nvme_cancel_request(struct request *req, void *data)
{
- dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
- "Cancelling I/O %d", req->tag);
+ struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
+
+ dev_dbg_ratelimited(ctrl->device, "Cancelling I/O %d", req->tag);

/* don't abort one completed or idle request */
if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT)
@@ -506,6 +507,8 @@ bool nvme_cancel_request(struct request *req, void *data)

nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+ if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
+ atomic_dec(&ctrl->nr_active);
blk_mq_complete_request(req);
return true;
}

Please note that I am using atomic_dec() instead of atomic_dec_if_positive()
above for the same reason as Keith mentioned in his earlier mail.

> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> +{
> + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> + unsigned int depth;
> +
> + list_for_each_entry_rcu(ns, &head->list, siblings) {
> + if (nvme_path_is_disabled(ns))
> + continue;
> +
> + depth = atomic_read(&ns->ctrl->nr_active);
> +
> + switch (ns->ana_state) {
> + case NVME_ANA_OPTIMIZED:
> + if (depth < min_depth_opt) {
> + min_depth_opt = depth;
> + best_opt = ns;
> + }
> + break;
> +
> + case NVME_ANA_NONOPTIMIZED:
> + if (depth < min_depth_nonopt) {
> + min_depth_nonopt = depth;
> + best_nonopt = ns;
> + }
> + break;
> + default:
> + break;
> + }
> + }
> +
> + return best_opt ? best_opt : best_nonopt;
> +}
> +
> static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
> {
> return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
> @@ -338,15 +384,27 @@ static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
>
> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> {
> - int node = numa_node_id();
> + int iopolicy = READ_ONCE(head->subsys->iopolicy);
> + int node;
> struct nvme_ns *ns;
>
> + /*
> + * queue-depth iopolicy does not need to reference ->current_path
> + * but round-robin needs the last path used to advance to the
> + * next one, and numa will continue to use the last path unless
> + * it is or has become not optimized
> + */
> + if (iopolicy == NVME_IOPOLICY_QD)
> + return nvme_queue_depth_path(head);
> +
> + node = numa_node_id();
> ns = srcu_dereference(head->current_path[node], &head->srcu);
> if (unlikely(!ns))
> return __nvme_find_path(head, node);
>
> - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
> + if (iopolicy == NVME_IOPOLICY_RR)
> return nvme_round_robin_path(head, node, ns);
> +
> if (unlikely(!nvme_path_is_optimized(ns)))
> return __nvme_find_path(head, node);
> return ns;
> @@ -798,6 +856,25 @@ static ssize_t nvme_subsys_iopolicy_show(struct device *dev,
> nvme_iopolicy_names[READ_ONCE(subsys->iopolicy)]);
> }
>
> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
> +{
> + struct nvme_ctrl *ctrl;
> + int old_iopolicy = READ_ONCE(subsys->iopolicy);
> +
> + WRITE_ONCE(subsys->iopolicy, iopolicy);
> +
> + mutex_lock(&nvme_subsystems_lock);
> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> + atomic_set(&ctrl->nr_active, 0);
> + nvme_mpath_clear_ctrl_paths(ctrl);
> + }
> + mutex_unlock(&nvme_subsystems_lock);
> +
> + pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__,
> + nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy],
> + subsys->subnqn);
> +}
> +
> static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t count)
> {
> @@ -807,7 +884,7 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
>
> for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
> if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
> - WRITE_ONCE(subsys->iopolicy, i);
> + nvme_subsys_iopolicy_update(subsys, i);
> return count;
> }
> }
> @@ -905,6 +982,7 @@ void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl)
> mutex_init(&ctrl->ana_lock);
> timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
> INIT_WORK(&ctrl->ana_work, nvme_ana_work);
> + atomic_set(&ctrl->nr_active, 0);
> }
>
> int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index f243a5822c2b..f5557889b244 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -50,6 +50,8 @@ extern struct workqueue_struct *nvme_wq;
> extern struct workqueue_struct *nvme_reset_wq;
> extern struct workqueue_struct *nvme_delete_wq;
>
> +extern struct mutex nvme_subsystems_lock;
> +
> /*
> * List of workarounds for devices that required behavior not specified in
> * the standard.
> @@ -190,6 +192,7 @@ enum {
> NVME_REQ_CANCELLED = (1 << 0),
> NVME_REQ_USERCMD = (1 << 1),
> NVME_MPATH_IO_STATS = (1 << 2),
> + NVME_MPATH_CNT_ACTIVE = (1 << 3),
> };
>
> static inline struct nvme_request *nvme_req(struct request *req)
> @@ -354,6 +357,7 @@ struct nvme_ctrl {
> size_t ana_log_size;
> struct timer_list anatt_timer;
> struct work_struct ana_work;
> + atomic_t nr_active;
> #endif
>
> #ifdef CONFIG_NVME_HOST_AUTH
> @@ -402,6 +406,7 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
> enum nvme_iopolicy {
> NVME_IOPOLICY_NUMA,
> NVME_IOPOLICY_RR,
> + NVME_IOPOLICY_QD,
> };
>
> struct nvme_subsystem {
> @@ -935,6 +940,7 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head);
> void nvme_mpath_start_request(struct request *rq);
> void nvme_mpath_end_request(struct request *rq);
> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy);
>
> static inline void nvme_trace_bio_complete(struct request *req)
> {
> @@ -1034,6 +1040,9 @@ static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
> {
> return false;
> }
> +static inline void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
> +{
> +}
> #endif /* CONFIG_NVME_MULTIPATH */
>
> int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,

Thanks,
--Nilay

2024-05-21 09:50:12

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"



On 21/05/2024 11:48, Nilay Shroff wrote:
>
> On 5/21/24 01:50, John Meneghini wrote:
>> From: "Ewan D. Milne" <[email protected]>
>>
>> The round-robin path selector is inefficient in cases where there is a
>> difference in latency between multiple active optimized paths. In the
>> presence of one or more high latency paths the round-robin selector
>> continues to the high latency path equally. This results in a bias
>> towards the highest latency path and can cause is significant decrease
>> in overall performance as IOs pile on the lowest latency path. This
>> problem is particularly accute with NVMe-oF controllers.
>>
>> The queue-depth policy instead sends I/O requests down the path with the
>> least amount of requests in its request queue. Paths with lower latency
>> will clear requests more quickly and have less requests in their queues
>> compared to higher latency paths. The goal of this path selector is to
>> make more use of lower latency paths, which will bring down overall IO
>> latency.
>>
>> Signed-off-by: Ewan D. Milne <[email protected]>
>> [tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
>> and compilation warnings, updated MODULE_PARM description, and
>> fixed potential issue with ->current_path[] being used]
>> Signed-off-by: Thomas Song <[email protected]>
>> [jmeneghi: vairious changes and improvements, addressed review comments]
>> Signed-off-by: John Meneghini <[email protected]>
>> Link: https://lore.kernel.org/linux-nvme/[email protected]/
>> Tested-by: Marco Patalano <[email protected]>
>> Reviewed-by: Randy Jennings <[email protected]>
>> Tested-by: Jyoti Rani <[email protected]>
>> ---
>> drivers/nvme/host/core.c | 2 +-
>> drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
>> drivers/nvme/host/nvme.h | 9 ++++
>> 3 files changed, 92 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index a066429b790d..1dd7c52293ff 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
>> EXPORT_SYMBOL_GPL(nvme_delete_wq);
>>
>> static LIST_HEAD(nvme_subsystems);
>> -static DEFINE_MUTEX(nvme_subsystems_lock);
>> +DEFINE_MUTEX(nvme_subsystems_lock);
>>
>> static DEFINE_IDA(nvme_instance_ida);
>> static dev_t nvme_ctrl_base_chr_devt;
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 5397fb428b24..0e2b6e720e95 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
>> static const char *nvme_iopolicy_names[] = {
>> [NVME_IOPOLICY_NUMA] = "numa",
>> [NVME_IOPOLICY_RR] = "round-robin",
>> + [NVME_IOPOLICY_QD] = "queue-depth",
>> };
>>
>> static int iopolicy = NVME_IOPOLICY_NUMA;
>> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>> iopolicy = NVME_IOPOLICY_NUMA;
>> else if (!strncmp(val, "round-robin", 11))
>> iopolicy = NVME_IOPOLICY_RR;
>> + else if (!strncmp(val, "queue-depth", 11))
>> + iopolicy = NVME_IOPOLICY_QD;
>> else
>> return -EINVAL;
>>
>> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>> &iopolicy, 0644);
>> MODULE_PARM_DESC(iopolicy,
>> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
>> + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>>
>> void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>> {
>> @@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
>> struct nvme_ns *ns = rq->q->queuedata;
>> struct gendisk *disk = ns->head->disk;
>>
>> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
>> + atomic_inc(&ns->ctrl->nr_active);
>> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
>> + }
>> +
>> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
>> return;
>>
>> @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
>> {
>> struct nvme_ns *ns = rq->q->queuedata;
>>
>> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>> + atomic_dec_if_positive(&ns->ctrl->nr_active);
>> +
>> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>> return;
>> +
>> bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>> blk_rq_bytes(rq) >> SECTOR_SHIFT,
>> nvme_req(rq)->start_time);
>> @@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
>> return found;
>> }
>>
> I think you may also want to reset nr_active counter if in case, in-flight nvme request
> is cancelled. If the request is cancelled then nvme_mpath_end_request() wouldn't be invoked.
> So you may want to reset nr_active counter from nvme_cancel_request() as below:
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index bf7615cb36ee..4fea7883ce8e 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -497,8 +497,9 @@ EXPORT_SYMBOL_GPL(nvme_host_path_error);
>
> bool nvme_cancel_request(struct request *req, void *data)
> {
> - dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
> - "Cancelling I/O %d", req->tag);
> + struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
> +
> + dev_dbg_ratelimited(ctrl->device, "Cancelling I/O %d", req->tag);
>
> /* don't abort one completed or idle request */
> if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT)
> @@ -506,6 +507,8 @@ bool nvme_cancel_request(struct request *req, void *data)
>
> nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
> nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
> + atomic_dec(&ctrl->nr_active);

Don't think this matters because cancellation only happens when we
teardown the controller anyways...

btw, can we have a better name than nr_active? this is just for IO and only
for multipath. Maybe ctrl->nr_mpath_io_active ?

Also perhaps rename the flag to NVME_MPATH_CTRL_IO_ACCOUNTING ?

2024-05-21 10:07:47

by Nilay Shroff

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"



On 5/21/24 15:15, Sagi Grimberg wrote:
>
>
> On 21/05/2024 11:48, Nilay Shroff wrote:
>>
>> On 5/21/24 01:50, John Meneghini wrote:
>>> From: "Ewan D. Milne" <[email protected]>
>>>
>>> The round-robin path selector is inefficient in cases where there is a
>>> difference in latency between multiple active optimized paths.  In the
>>> presence of one or more high latency paths the round-robin selector
>>> continues to the high latency path equally. This results in a bias
>>> towards the highest latency path and can cause is significant decrease
>>> in overall performance as IOs pile on the lowest latency path. This
>>> problem is particularly accute with NVMe-oF controllers.
>>>
>>> The queue-depth policy instead sends I/O requests down the path with the
>>> least amount of requests in its request queue. Paths with lower latency
>>> will clear requests more quickly and have less requests in their queues
>>> compared to higher latency paths. The goal of this path selector is to
>>> make more use of lower latency paths, which will bring down overall IO
>>> latency.
>>>
>>> Signed-off-by: Ewan D. Milne <[email protected]>
>>> [tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
>>>        and compilation warnings, updated MODULE_PARM description, and
>>>        fixed potential issue with ->current_path[] being used]
>>> Signed-off-by: Thomas Song <[email protected]>
>>> [jmeneghi: vairious changes and improvements, addressed review comments]
>>> Signed-off-by: John Meneghini <[email protected]>
>>> Link: https://lore.kernel.org/linux-nvme/[email protected]/
>>> Tested-by: Marco Patalano <[email protected]>
>>> Reviewed-by: Randy Jennings <[email protected]>
>>> Tested-by: Jyoti Rani <[email protected]>
>>> ---
>>>   drivers/nvme/host/core.c      |  2 +-
>>>   drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
>>>   drivers/nvme/host/nvme.h      |  9 ++++
>>>   3 files changed, 92 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index a066429b790d..1dd7c52293ff 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
>>>   EXPORT_SYMBOL_GPL(nvme_delete_wq);
>>>     static LIST_HEAD(nvme_subsystems);
>>> -static DEFINE_MUTEX(nvme_subsystems_lock);
>>> +DEFINE_MUTEX(nvme_subsystems_lock);
>>>     static DEFINE_IDA(nvme_instance_ida);
>>>   static dev_t nvme_ctrl_base_chr_devt;
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index 5397fb428b24..0e2b6e720e95 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
>>>   static const char *nvme_iopolicy_names[] = {
>>>       [NVME_IOPOLICY_NUMA]    = "numa",
>>>       [NVME_IOPOLICY_RR]    = "round-robin",
>>> +    [NVME_IOPOLICY_QD]      = "queue-depth",
>>>   };
>>>     static int iopolicy = NVME_IOPOLICY_NUMA;
>>> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>>           iopolicy = NVME_IOPOLICY_NUMA;
>>>       else if (!strncmp(val, "round-robin", 11))
>>>           iopolicy = NVME_IOPOLICY_RR;
>>> +    else if (!strncmp(val, "queue-depth", 11))
>>> +        iopolicy = NVME_IOPOLICY_QD;
>>>       else
>>>           return -EINVAL;
>>>   @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>>>   module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>>>       &iopolicy, 0644);
>>>   MODULE_PARM_DESC(iopolicy,
>>> -    "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
>>> +    "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>>>     void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>>>   {
>>> @@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
>>>       struct nvme_ns *ns = rq->q->queuedata;
>>>       struct gendisk *disk = ns->head->disk;
>>>   +    if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
>>> +        atomic_inc(&ns->ctrl->nr_active);
>>> +        nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
>>> +    }
>>> +
>>>       if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
>>>           return;
>>>   @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
>>>   {
>>>       struct nvme_ns *ns = rq->q->queuedata;
>>>   +    if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>>> +        atomic_dec_if_positive(&ns->ctrl->nr_active);
>>> +
>>>       if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>>>           return;
>>> +
>>>       bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>>>                blk_rq_bytes(rq) >> SECTOR_SHIFT,
>>>                nvme_req(rq)->start_time);
>>> @@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
>>>       return found;
>>>   }
>>>  
>> I think you may also want to reset nr_active counter if in case, in-flight nvme request
>> is cancelled. If the request is cancelled then nvme_mpath_end_request() wouldn't be invoked.
>> So you may want to reset nr_active counter from nvme_cancel_request() as below:
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index bf7615cb36ee..4fea7883ce8e 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -497,8 +497,9 @@ EXPORT_SYMBOL_GPL(nvme_host_path_error);
>>     bool nvme_cancel_request(struct request *req, void *data)
>>   {
>> -       dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
>> -                               "Cancelling I/O %d", req->tag);
>> +       struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
>> +
>> +       dev_dbg_ratelimited(ctrl->device, "Cancelling I/O %d", req->tag);
>>            /* don't abort one completed or idle request */
>>          if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT)
>> @@ -506,6 +507,8 @@ bool nvme_cancel_request(struct request *req, void *data)
>>            nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
>>          nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>> +       if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>> +               atomic_dec(&ctrl->nr_active);
>
> Don't think this matters because cancellation only happens when we
> teardown the controller anyways...
>
I think in case if we reset the nvme controller then we don't teardown
controller, isn't it? In this case we cancel all pending requests, and
later restart the controller.

Thanks,
--Nilay


2024-05-21 10:11:30

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"


>> Don't think this matters because cancellation only happens when we
>> teardown the controller anyways...
>>
> I think in case if we reset the nvme controller then we don't teardown
> controller, isn't it? In this case we cancel all pending requests, and
> later restart the controller.

Exactly, nvme_mpath_init_ctrl resets the counter.

2024-05-21 10:23:13

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"



On 21/05/2024 13:11, Sagi Grimberg wrote:
>
>>> Don't think this matters because cancellation only happens when we
>>> teardown the controller anyways...
>>>
>> I think in case if we reset the nvme controller then we don't teardown
>> controller, isn't it? In this case we cancel all pending requests, and
>> later restart the controller.
>
> Exactly, nvme_mpath_init_ctrl resets the counter.

Except you're right, the counter reset needs to move to
nvme_mpath_init_identify()
or some place that is called on every controller reset.

2024-05-21 10:24:10

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"



On 21/05/2024 13:15, Sagi Grimberg wrote:
>
>
> On 21/05/2024 13:11, Sagi Grimberg wrote:
>>
>>>> Don't think this matters because cancellation only happens when we
>>>> teardown the controller anyways...
>>>>
>>> I think in case if we reset the nvme controller then we don't teardown
>>> controller, isn't it? In this case we cancel all pending requests, and
>>> later restart the controller.
>>
>> Exactly, nvme_mpath_init_ctrl resets the counter.
>
> Except you're right, the counter reset needs to move to
> nvme_mpath_init_identify()
> or some place that is called on every controller reset.

This however raises the question of how much failover/reset tests this
patch has seen...

2024-05-21 10:28:14

by Nilay Shroff

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"



On 5/21/24 15:45, Sagi Grimberg wrote:
>
>
> On 21/05/2024 13:11, Sagi Grimberg wrote:
>>
>>>> Don't think this matters because cancellation only happens when we
>>>> teardown the controller anyways...
>>>>
>>> I think in case if we reset the nvme controller then we don't teardown
>>> controller, isn't it? In this case we cancel all pending requests, and
>>> later restart the controller.
>>
>> Exactly, nvme_mpath_init_ctrl resets the counter.
>
> Except you're right, the counter reset needs to move to nvme_mpath_init_identify()
> or some place that is called on every controller reset.
Yeah that was my point. Unfortunately, nvme_mpath_init_ctrl() is not called when we do nvme
controller reset. So either we move nvme_mpath_init_ctrl() to some common place which
is called from regular controller init code as well as controller reset code path
or we may explicitly reset nr_active counter from nvme_timeout().

Thanks,
--Nilay

2024-05-21 13:06:14

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On Tue, May 21, 2024 at 02:18:09PM +0530, Nilay Shroff wrote:
> On 5/21/24 01:50, John Meneghini wrote:
> > @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
> > {
> > struct nvme_ns *ns = rq->q->queuedata;
> >
> > + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
> > + atomic_dec_if_positive(&ns->ctrl->nr_active);
> > +
> > if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
> > return;
> > +
> > bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
> > blk_rq_bytes(rq) >> SECTOR_SHIFT,
> > nvme_req(rq)->start_time);
> > @@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> > return found;
> > }
> >
> I think you may also want to reset nr_active counter if in case, in-flight nvme request
> is cancelled. If the request is cancelled then nvme_mpath_end_request() wouldn't be invoked.
> So you may want to reset nr_active counter from nvme_cancel_request() as below:
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index bf7615cb36ee..4fea7883ce8e 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -497,8 +497,9 @@ EXPORT_SYMBOL_GPL(nvme_host_path_error);
>
> bool nvme_cancel_request(struct request *req, void *data)
> {
> - dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
> - "Cancelling I/O %d", req->tag);
> + struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
> +
> + dev_dbg_ratelimited(ctrl->device, "Cancelling I/O %d", req->tag);
>
> /* don't abort one completed or idle request */
> if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT)
> @@ -506,6 +507,8 @@ bool nvme_cancel_request(struct request *req, void *data)
>
> nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
> nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
> + atomic_dec(&ctrl->nr_active);
> blk_mq_complete_request(req);
> return true;
> }

The io stats wouldn't be right if that happened. And maybe it isn't
right on a failover, but it needs to be. Would it work if
nvme_failover_req() calls nvme_end_req() instead of directly calling
blk_mq_end_req()?

2024-05-21 13:58:59

by John Meneghini

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On 5/21/24 02:46, Hannes Reinecke wrote:
> On 5/20/24 22:20, John Meneghini wrote:
>> From: "Ewan D. Milne" <[email protected]>
>>
..
>> Tested-by: Marco Patalano <[email protected]>
>> Reviewed-by: Randy Jennings <[email protected]>

I need to fix this. Randy doesn't have a redhat.com email address... Cut an paste error :-(

>> Tested-by: Jyoti Rani <[email protected]>
>>
..
>> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
>> +{
>> +    struct nvme_ctrl *ctrl;
>> +    int old_iopolicy = READ_ONCE(subsys->iopolicy);
>> +
>> +    WRITE_ONCE(subsys->iopolicy, iopolicy);
>> +
>> +    mutex_lock(&nvme_subsystems_lock);
>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>> +        atomic_set(&ctrl->nr_active, 0);
>> +        nvme_mpath_clear_ctrl_paths(ctrl);
>
> You always reset the variables here, even if specified iopolicy is
> the same than the currently active one.
> I'd rather check if the iopolicy is different before changing the settings.

Yes, Keith pointed this out too. This is actually a feature not a bug. In situations were we want to "reset" the nr_active
counters on all controllers the user can simply set the queue-depth iopolicy a second time. I don't expect users to do this
very often... they shouldn't be changing IO policies back and forth too much... but the ability to "reset" the nr_active
counters during testing has been very helpful and important to do. So I'd like to keep this. Moreover, this is NOT the
performance path. I don't see the point in making performance optimizations in a code path that is run once a year.

/John


2024-05-21 14:10:52

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On Tue, May 21, 2024 at 09:58:31AM -0400, John Meneghini wrote:
> On 5/21/24 02:46, Hannes Reinecke wrote:
> > > +??? list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> > > +??????? atomic_set(&ctrl->nr_active, 0);
> > > +??????? nvme_mpath_clear_ctrl_paths(ctrl);
> >
> > You always reset the variables here, even if specified iopolicy is
> > the same than the currently active one.
> > I'd rather check if the iopolicy is different before changing the settings.
>
> Yes, Keith pointed this out too. This is actually a feature not a bug. In
> situations were we want to "reset" the nr_active counters on all controllers
> the user can simply set the queue-depth iopolicy a second time. I don't
> expect users to do this very often... they shouldn't be changing IO policies
> back and forth too much... but the ability to "reset" the nr_active counters
> during testing has been very helpful and important to do. So I'd like to
> keep this. Moreover, this is NOT the performance path. I don't see the
> point in making performance optimizations in a code path that is run once a
> year.

I missed that you actually want to reset the counters on a live queue.
Wouldn't that just lead to an imbalance? If that is really a feature,
then I retract a previous comment: you do need the atomic_dec_not_zero
(or whatever it was called) since the active count is no longer tied to
the inc's.

2024-05-21 14:22:01

by John Meneghini

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On 5/20/24 16:50, Keith Busch wrote:
>> static LIST_HEAD(nvme_subsystems);
>> -static DEFINE_MUTEX(nvme_subsystems_lock);
>> +DEFINE_MUTEX(nvme_subsystems_lock);
> This seems odd. Why is this lock protecting both the global
> nvme_subsystems list, and also subsystem controllers? IOW, why isn't the
> subsys->ctrls list protected by the more fine grained 'subsys->lock'
> instead of this global lock?
>
>> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>> &iopolicy, 0644);
>> MODULE_PARM_DESC(iopolicy,
>> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
>> + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>
> Unnecessary space before the ','.

I'll fix this.

>> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
>> + atomic_inc(&ns->ctrl->nr_active);
>> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
>> + }
>> +
>> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
>> return;
>>
>> @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
>> {
>> struct nvme_ns *ns = rq->q->queuedata;
>>
>> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>> + atomic_dec_if_positive(&ns->ctrl->nr_active);
>
> You can just do a atomic_dec() since your new flag has this tied to to
> the atomic_inc().

No, I don't think that would be correct because, for a number of reasons, this counter could go below zero. e.g. there is
nothing to prevent the IO policy from changing between nvme_mpath_start_request and nvme_mpath_end_request, and there are code
paths like the reset/cancel code path which could affect this counter.

Also, this code path has been extensively tested. Ewan has played with all kinds of different counting schemes, which didn't
work. I'm happy to make non-functional changes, but a functional change like this would require completely retesting and
re-verifying thing.

I'd like to keep this.

>> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>> +{
>> + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
>> + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
>> + unsigned int depth;
>> +
>> + list_for_each_entry_rcu(ns, &head->list, siblings) {
>> + if (nvme_path_is_disabled(ns))
>> + continue;
>> +
>> + depth = atomic_read(&ns->ctrl->nr_active);
>> +
>> + switch (ns->ana_state) {
>> + case NVME_ANA_OPTIMIZED:
>> + if (depth < min_depth_opt) {
>> + min_depth_opt = depth;
>> + best_opt = ns;
>> + }
>> + break;
>> +
>> + case NVME_ANA_NONOPTIMIZED:
>> + if (depth < min_depth_nonopt) {
>> + min_depth_nonopt = depth;
>> + best_nonopt = ns;
>> + }
>> + break;
>> + default:
>> + break;
>> + }
>
> Could we break out of this loop early if "min_depth_opt == 0"? We can't
> find a better path that that, so no need to read the rest of the paths.

I can do that... since we are keeping the atomic_dec_if_positive() above. ;-)

>> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
>> +{
>> + struct nvme_ctrl *ctrl;
>> + int old_iopolicy = READ_ONCE(subsys->iopolicy);
>> +
> Let's add a check here:
>
> if (old_iopolicy == iopolicy)
> return;

Actually, this is feature, not a bug. I'd like to keep the ability to reset the nr_active counters. It is an invaluable tool
that I use during testing. See my comments to Hannes.

>> @@ -935,6 +940,7 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
>> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head);
>> void nvme_mpath_start_request(struct request *rq);
>> void nvme_mpath_end_request(struct request *rq);
>> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy);
>
> This funciton isn't used outside multipath.c, so it should be static.

I'll fix this.

/John


2024-05-21 14:24:04

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On 5/21/24 15:58, John Meneghini wrote:
> On 5/21/24 02:46, Hannes Reinecke wrote:
>> On 5/20/24 22:20, John Meneghini wrote:
>>> From: "Ewan D. Milne" <[email protected]>
>>>
> ...
>>> Tested-by: Marco Patalano <[email protected]>
>>> Reviewed-by: Randy Jennings <[email protected]>
>
> I need to fix this. Randy doesn't have a redhat.com email address... Cut
> an paste error :-(
>
>>> Tested-by: Jyoti Rani <[email protected]>
>>>
> ...
>>> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int
>>> iopolicy)
>>> +{
>>> +    struct nvme_ctrl *ctrl;
>>> +    int old_iopolicy = READ_ONCE(subsys->iopolicy);
>>> +
>>> +    WRITE_ONCE(subsys->iopolicy, iopolicy);
>>> +
>>> +    mutex_lock(&nvme_subsystems_lock);
>>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>>> +        atomic_set(&ctrl->nr_active, 0);
>>> +        nvme_mpath_clear_ctrl_paths(ctrl);
>>
>> You always reset the variables here, even if specified iopolicy is
>> the same than the currently active one.
>> I'd rather check if the iopolicy is different before changing the
>> settings.
>
> Yes, Keith pointed this out too.  This is actually a feature not a bug.
> In situations were we want to "reset" the nr_active counters on all
> controllers the user can simply set the queue-depth iopolicy a second
> time.  I don't expect users to do this very often... they shouldn't be
> changing IO policies back and forth too much... but the ability to
> "reset" the nr_active counters during testing has been very helpful and
> important to do.  So I'd like to keep this.  Moreover, this is NOT the
> performance path. I don't see the point in making performance
> optimizations in a code path that is run once a year.
>
Please add a comment indicating that.
Just to make it clear that it's by design.

Cheers,

Hannes

2024-05-21 14:59:29

by John Meneghini

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On 5/21/24 06:16, Sagi Grimberg wrote:
>>>
>>> Exactly, nvme_mpath_init_ctrl resets the counter.
>>
>> Except you're right, the counter reset needs to move to nvme_mpath_init_identify()
>> or some place that is called on every controller reset.
>
> This however raises the question of how much failover/reset tests this patch has seen...

I has received quite a bit of testing with failover and controller resets. I shared some of the testing that was done at LSFMM
last week.

It has received enough testing to make me confident that this code is safe. That is: it won't panic, corrupt data, or otherwise
do any harm. We believe the error paths will not be affected by this change... but I agree that running the error paths could
negatively impact the accuracy of the nr_active counters... which could lead to an inaccurate outcome with the queue-depth policy.

I agree the nr_counter initialize should move to nvme_mpath_init_identify(), or maybe be done there in addition to in
nvme_mpath_init_ctrl(). I'm will to make that change now... if that's what people want. I don't think it would require any
extensive retesting.

/John


2024-05-21 16:36:22

by Caleb Sander

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On Tue, May 21, 2024 at 8:03 AM John Meneghini <[email protected]> wrote:
>
> On 5/21/24 02:46, Hannes Reinecke wrote:
> > On 5/20/24 22:20, John Meneghini wrote:
> >> From: "Ewan D. Milne" <[email protected]>
> >>
> ...
> >> Tested-by: Marco Patalano <[email protected]>
> >> Reviewed-by: Randy Jennings <[email protected]>
>
> I need to fix this. Randy doesn't have a redhat.com email address... Cut an paste error :-(
>
> >> Tested-by: Jyoti Rani <[email protected]>
> >>

Should be "jrani" instead of "jani" here also.

Thanks,
Caleb

2024-05-22 10:48:36

by Nilay Shroff

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"



On 5/21/24 20:14, John Meneghini wrote:
> On 5/21/24 06:16, Sagi Grimberg wrote:
>>>>
>>>> Exactly, nvme_mpath_init_ctrl resets the counter.
>>>
>>> Except you're right, the counter reset needs to move to nvme_mpath_init_identify()
>>> or some place that is called on every controller reset.
>>
>> This however raises the question of how much failover/reset tests this patch has seen...
>
> I has received quite a bit of testing with failover and controller resets.  I shared some of the testing that was done at LSFMM last week.
>
> It has received enough testing to make me confident that this code is safe.  That is: it won't panic, corrupt data, or otherwise do any harm.  We believe the error paths will not be affected by this change... but I agree that running the error paths could negatively impact the accuracy of the nr_active counters... which could lead to an inaccurate outcome with the queue-depth policy.
>
> I agree the nr_counter initialize should move to nvme_mpath_init_identify(), or maybe be done there in addition to in nvme_mpath_init_ctrl(). I'm will to make that change now... if that's what people want.  I don't think it would require any extensive retesting.
>
> /John
>
>
I think with Keith's recent proposed patch for fixing io accounting on failover, the
nvme_mpath_end_request() would be called even for cancelled IO and so the nr_active
counter shall be adjusted correctly for cancelled IO requests. Having said that, IMO
you shall consider moving initialization of nr_active counter to nvme_mpath_init_identify()
as that's common function invoked from regular controller initialization code path as well
the reset code path.

Thanks,
--Nilay

2024-05-22 10:54:53

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"



On 22/05/2024 13:48, Nilay Shroff wrote:
>
> On 5/21/24 20:14, John Meneghini wrote:
>> On 5/21/24 06:16, Sagi Grimberg wrote:
>>>>> Exactly, nvme_mpath_init_ctrl resets the counter.
>>>> Except you're right, the counter reset needs to move to nvme_mpath_init_identify()
>>>> or some place that is called on every controller reset.
>>> This however raises the question of how much failover/reset tests this patch has seen...
>> I has received quite a bit of testing with failover and controller resets.  I shared some of the testing that was done at LSFMM last week.
>>
>> It has received enough testing to make me confident that this code is safe.  That is: it won't panic, corrupt data, or otherwise do any harm.  We believe the error paths will not be affected by this change... but I agree that running the error paths could negatively impact the accuracy of the nr_active counters... which could lead to an inaccurate outcome with the queue-depth policy.
>>
>> I agree the nr_counter initialize should move to nvme_mpath_init_identify(), or maybe be done there in addition to in nvme_mpath_init_ctrl(). I'm will to make that change now... if that's what people want.  I don't think it would require any extensive retesting.
>>
>> /John
>>
>>
> I think with Keith's recent proposed patch for fixing io accounting on failover, the
> nvme_mpath_end_request() would be called even for cancelled IO and so the nr_active
> counter shall be adjusted correctly for cancelled IO requests. Having said that, IMO
> you shall consider moving initialization of nr_active counter to nvme_mpath_init_identify()
> as that's common function invoked from regular controller initialization code path as well
> the reset code path.

Yes, and preferably with a comment explaining why its there (despite
having nothing to do with identify...)

2024-05-22 13:12:45

by John Meneghini

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

On 5/22/24 06:48, Nilay Shroff wrote:
>> I has received quite a bit of testing with failover and controller resets.  I shared some of the testing that was done at LSFMM last week.
>>
>> It has received enough testing to make me confident that this code is safe.  That is: it won't panic, corrupt data, or otherwise do any harm.  We believe the error paths will not be affected by this change... but I agree that running the error paths could negatively impact the accuracy of the nr_active counters... which could lead to an inaccurate outcome with the queue-depth policy.
>>
>> I agree the nr_counter initialize should move to nvme_mpath_init_identify(), or maybe be done there in addition to in nvme_mpath_init_ctrl(). I'm will to make that change now... if that's what people want.  I don't think it would require any extensive retesting.
>>
>> /John
>>
>>
> I think with Keith's recent proposed patch for fixing io accounting on failover, the
> nvme_mpath_end_request() would be called even for cancelled IO and so the nr_active
> counter shall be adjusted correctly for cancelled IO requests. Having said that, IMO
> you shall consider moving initialization of nr_active counter to nvme_mpath_init_identify()
> as that's common function invoked from regular controller initialization code path as well
> the reset code path.

Agreed. I'll do that in V4 of the patch set.

/John