2024-06-12 00:20:59

by John Meneghini

[permalink] [raw]
Subject: [PATCH v6 0/1] nvme: queue-depth multipath iopolicy

I've rebased this patch onto nvme-6.11, addressed all review comments,
and retested everything.

The new test results can be seen at:

https://github.com/johnmeneghini/iopolicy/tree/sample3

Changes since V5:

Refactored nvme_find_path() to reduce the spaghetti code.
Cleaned up all comments and reduced the total size of the
diff, and fixed the commit message. Thomas Song now
gets credit as the first author.

Changes since V4:

Removed atomic_set() from and return if (old_iopolicy == iopolicy)
At the beginning of nvme_subsys_iopolicy_update().

Changes since V3:

Addresssed all review comments, fixed the commit log, and moved
nr_counter initialization from nvme_mpath_init_ctlr() to
nvme_mpath_init_identify().

Changes since V2:

Add the NVME_MPATH_CNT_ACTIVE flag to eliminate a READ_ONCE in the
completion path and increment/decrement the active_nr count on all mpath
IOs - including passthru commands.

Send a pr_notice when ever the iopolicy on a subsystem is changed. This
is important for support reasons. It is fully expected that users will
be changing the iopolicy with active IO in progress.

Squashed everything and rebased to nvme-v6.10

Changes since V1:

I'm re-issuing Ewan's queue-depth patches in preparation for LSFMM

These patches were first show at ALPSS 2023 where I shared the following
graphs which measure the IO distribution across 4 active-optimized
controllers using the round-robin verses queue-depth iopolicy.

https://people.redhat.com/jmeneghi/ALPSS_2023/NVMe_QD_Multipathing.pdf

Since that time we have continued testing these patches with a number of
different nvme-of storage arrays and test bed configurations, and I've
codified the tests and methods we use to measure IO distribution

All of my test results, together with the scripts I used to generate these
graphs, are available at:

https://github.com/johnmeneghini/iopolicy

Please use the scripts in this repository to do your own testing.

These patches are based on nvme-v6.9

Thomas Song (1):
nvme-multipath: implement "queue-depth" iopolicy

drivers/nvme/host/core.c | 2 +-
drivers/nvme/host/multipath.c | 108 +++++++++++++++++++++++++++++++---
drivers/nvme/host/nvme.h | 5 ++
3 files changed, 106 insertions(+), 9 deletions(-)

--
2.39.3



2024-06-12 00:21:07

by John Meneghini

[permalink] [raw]
Subject: [PATCH v6 1/1] nvme-multipath: implement "queue-depth" iopolicy

From: Thomas Song <[email protected]>

The round-robin path selector is inefficient in cases where there is a
difference in latency between paths. In the presence of one or more
high latency paths the round-robin selector continues to use the high
latency path equally. This results in a bias towards the highest latency
path and can cause a significant decrease in overall performance as IOs
pile on the highest latency path. This problem is acute with NVMe-oF
controllers.

The queue-depth policy instead sends I/O requests down the path with the
least amount of requests in its request queue. Paths with lower latency
will clear requests more quickly and have less requests in their queues
compared to higher latency paths. The goal of this path selector is to
make more use of lower latency paths which will bring down overall IO
latency and increase throughput and performance.

Signed-off-by: Thomas Song <[email protected]>
[emilne: patch developed by Thomas Song @ Pure Storage, fixed whitespace
and compilation warnings, updated MODULE_PARM description, and
fixed potential issue with ->current_path[] being used]
Co-developed-by: Ewan D. Milne <[email protected]>
Signed-off-by: Ewan D. Milne <[email protected]>
[jmeneghi: vairious changes and improvements, addressed review comments]
Co-developed-by: John Meneghini <[email protected]>
Signed-off-by: John Meneghini <[email protected]>
Link: https://lore.kernel.org/linux-nvme/[email protected]/
Tested-by: Marco Patalano <[email protected]>
Reviewed-by: Randy Jennings <[email protected]>
Tested-by: Jyoti Rani <[email protected]>
---
drivers/nvme/host/core.c | 2 +-
drivers/nvme/host/multipath.c | 108 +++++++++++++++++++++++++++++++---
drivers/nvme/host/nvme.h | 5 ++
3 files changed, 106 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7c9f91314d36..c10ff8815d82 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
EXPORT_SYMBOL_GPL(nvme_delete_wq);

static LIST_HEAD(nvme_subsystems);
-static DEFINE_MUTEX(nvme_subsystems_lock);
+DEFINE_MUTEX(nvme_subsystems_lock);

static DEFINE_IDA(nvme_instance_ida);
static dev_t nvme_ctrl_base_chr_devt;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 03a6868f4dbc..fe10e0cebcf0 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
static const char *nvme_iopolicy_names[] = {
[NVME_IOPOLICY_NUMA] = "numa",
[NVME_IOPOLICY_RR] = "round-robin",
+ [NVME_IOPOLICY_QD] = "queue-depth",
};

static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
iopolicy = NVME_IOPOLICY_NUMA;
else if (!strncmp(val, "round-robin", 11))
iopolicy = NVME_IOPOLICY_RR;
+ else if (!strncmp(val, "queue-depth", 11))
+ iopolicy = NVME_IOPOLICY_QD;
else
return -EINVAL;

@@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
&iopolicy, 0644);
MODULE_PARM_DESC(iopolicy,
- "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
+ "Default multipath I/O policy; 'numa' (default), 'round-robin' or 'queue-depth'");

void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
{
@@ -128,6 +131,11 @@ void nvme_mpath_start_request(struct request *rq)
struct nvme_ns *ns = rq->q->queuedata;
struct gendisk *disk = ns->head->disk;

+ if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
+ atomic_inc(&ns->ctrl->nr_active);
+ nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
+ }
+
if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
return;

@@ -140,6 +148,12 @@ EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
void nvme_mpath_end_request(struct request *rq)
{
struct nvme_ns *ns = rq->q->queuedata;
+ int result;
+
+ if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)) {
+ result = atomic_dec_if_positive(&ns->ctrl->nr_active);
+ WARN_ON_ONCE(result < 0);
+ }

if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
return;
@@ -291,10 +305,15 @@ static struct nvme_ns *nvme_next_ns(struct nvme_ns_head *head,
return list_first_or_null_rcu(&head->list, struct nvme_ns, siblings);
}

-static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
- int node, struct nvme_ns *old)
+static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
{
- struct nvme_ns *ns, *found = NULL;
+ struct nvme_ns *ns, *old, *found = NULL;
+ int node = numa_node_id();
+
+ old = srcu_dereference(head->current_path[node], &head->srcu);
+
+ if (unlikely(!old))
+ return __nvme_find_path(head, node);

if (list_is_singular(&head->list)) {
if (nvme_path_is_disabled(old))
@@ -334,13 +353,49 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
return found;
}

+static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
+{
+ struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+ unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+ unsigned int depth;
+
+ list_for_each_entry_rcu(ns, &head->list, siblings) {
+ if (nvme_path_is_disabled(ns))
+ continue;
+
+ depth = atomic_read(&ns->ctrl->nr_active);
+
+ switch (ns->ana_state) {
+ case NVME_ANA_OPTIMIZED:
+ if (depth < min_depth_opt) {
+ min_depth_opt = depth;
+ best_opt = ns;
+ }
+ break;
+ case NVME_ANA_NONOPTIMIZED:
+ if (depth < min_depth_nonopt) {
+ min_depth_nonopt = depth;
+ best_nonopt = ns;
+ }
+ break;
+ default:
+ break;
+ }
+
+ if (min_depth_opt == 0)
+ return best_opt;
+ }
+
+ return best_opt ? best_opt : best_nonopt;
+}
+
static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
{
return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
ns->ana_state == NVME_ANA_OPTIMIZED;
}

-inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
{
int node = numa_node_id();
struct nvme_ns *ns;
@@ -349,13 +404,25 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
if (unlikely(!ns))
return __nvme_find_path(head, node);

- if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
- return nvme_round_robin_path(head, node, ns);
if (unlikely(!nvme_path_is_optimized(ns)))
return __nvme_find_path(head, node);
+
return ns;
}

+
+inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+{
+ switch (READ_ONCE(head->subsys->iopolicy)) {
+ case NVME_IOPOLICY_QD:
+ return nvme_queue_depth_path(head);
+ case NVME_IOPOLICY_RR:
+ return nvme_round_robin_path(head);
+ default:
+ return nvme_numa_path(head);
+ }
+}
+
static bool nvme_available_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns;
@@ -803,6 +870,28 @@ static ssize_t nvme_subsys_iopolicy_show(struct device *dev,
nvme_iopolicy_names[READ_ONCE(subsys->iopolicy)]);
}

+static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
+ int iopolicy)
+{
+ struct nvme_ctrl *ctrl;
+ int old_iopolicy = READ_ONCE(subsys->iopolicy);
+
+ if (old_iopolicy == iopolicy)
+ return;
+
+ WRITE_ONCE(subsys->iopolicy, iopolicy);
+
+ /* iopolicy changes clear the mpath by design */
+ mutex_lock(&nvme_subsystems_lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ nvme_mpath_clear_ctrl_paths(ctrl);
+ mutex_unlock(&nvme_subsystems_lock);
+
+ pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__,
+ nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy],
+ subsys->subnqn);
+}
+
static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
{
@@ -812,7 +901,7 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,

for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
- WRITE_ONCE(subsys->iopolicy, i);
+ nvme_subsys_iopolicy_update(subsys, i);
return count;
}
}
@@ -923,6 +1012,9 @@ int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
!(ctrl->subsys->cmic & NVME_CTRL_CMIC_ANA))
return 0;

+ /* initialize this in the identify path to cover controller resets */
+ atomic_set(&ctrl->nr_active, 0);
+
if (!ctrl->max_namespaces ||
ctrl->max_namespaces > le32_to_cpu(id->nn)) {
dev_err(ctrl->device,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 73442d3f504b..d6c1fe3e2832 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -50,6 +50,8 @@ extern struct workqueue_struct *nvme_wq;
extern struct workqueue_struct *nvme_reset_wq;
extern struct workqueue_struct *nvme_delete_wq;

+extern struct mutex nvme_subsystems_lock;
+
/*
* List of workarounds for devices that required behavior not specified in
* the standard.
@@ -195,6 +197,7 @@ enum {
NVME_REQ_CANCELLED = (1 << 0),
NVME_REQ_USERCMD = (1 << 1),
NVME_MPATH_IO_STATS = (1 << 2),
+ NVME_MPATH_CNT_ACTIVE = (1 << 3),
};

static inline struct nvme_request *nvme_req(struct request *req)
@@ -360,6 +363,7 @@ struct nvme_ctrl {
size_t ana_log_size;
struct timer_list anatt_timer;
struct work_struct ana_work;
+ atomic_t nr_active;
#endif

#ifdef CONFIG_NVME_HOST_AUTH
@@ -408,6 +412,7 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
enum nvme_iopolicy {
NVME_IOPOLICY_NUMA,
NVME_IOPOLICY_RR,
+ NVME_IOPOLICY_QD,
};

struct nvme_subsystem {
--
2.39.3


2024-06-12 01:44:41

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v6 1/1] nvme-multipath: implement "queue-depth" iopolicy

On 6/11/24 17:20, John Meneghini wrote:
> From: Thomas Song <[email protected]>
>
> The round-robin path selector is inefficient in cases where there is a
> difference in latency between paths. In the presence of one or more
> high latency paths the round-robin selector continues to use the high
> latency path equally. This results in a bias towards the highest latency
> path and can cause a significant decrease in overall performance as IOs
> pile on the highest latency path. This problem is acute with NVMe-oF
> controllers.
>
> The queue-depth policy instead sends I/O requests down the path with the
> least amount of requests in its request queue. Paths with lower latency
> will clear requests more quickly and have less requests in their queues
> compared to higher latency paths. The goal of this path selector is to
> make more use of lower latency paths which will bring down overall IO
> latency and increase throughput and performance.
>
> Signed-off-by: Thomas Song <[email protected]>
> [emilne: patch developed by Thomas Song @ Pure Storage, fixed whitespace
> and compilation warnings, updated MODULE_PARM description, and
> fixed potential issue with ->current_path[] being used]
> Co-developed-by: Ewan D. Milne <[email protected]>
> Signed-off-by: Ewan D. Milne <[email protected]>
> [jmeneghi: vairious changes and improvements, addressed review comments]
> Co-developed-by: John Meneghini <[email protected]>
> Signed-off-by: John Meneghini <[email protected]>
> Link: https://lore.kernel.org/linux-nvme/[email protected]/
> Tested-by: Marco Patalano <[email protected]>
> Reviewed-by: Randy Jennings <[email protected]>
> Tested-by: Jyoti Rani <[email protected]>
> ---
> drivers/nvme/host/core.c | 2 +-
> drivers/nvme/host/multipath.c | 108 +++++++++++++++++++++++++++++++---
> drivers/nvme/host/nvme.h | 5 ++
> 3 files changed, 106 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7c9f91314d36..c10ff8815d82 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
> EXPORT_SYMBOL_GPL(nvme_delete_wq);
>
> static LIST_HEAD(nvme_subsystems);
> -static DEFINE_MUTEX(nvme_subsystems_lock);
> +DEFINE_MUTEX(nvme_subsystems_lock);
>
> static DEFINE_IDA(nvme_instance_ida);
> static dev_t nvme_ctrl_base_chr_devt;
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 03a6868f4dbc..fe10e0cebcf0 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
> static const char *nvme_iopolicy_names[] = {
> [NVME_IOPOLICY_NUMA] = "numa",
> [NVME_IOPOLICY_RR] = "round-robin",
> + [NVME_IOPOLICY_QD] = "queue-depth",
> };
>
> static int iopolicy = NVME_IOPOLICY_NUMA;
> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
> iopolicy = NVME_IOPOLICY_NUMA;
> else if (!strncmp(val, "round-robin", 11))
> iopolicy = NVME_IOPOLICY_RR;
> + else if (!strncmp(val, "queue-depth", 11))
> + iopolicy = NVME_IOPOLICY_QD;
> else
> return -EINVAL;
>
> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
> &iopolicy, 0644);
> MODULE_PARM_DESC(iopolicy,
> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> + "Default multipath I/O policy; 'numa' (default), 'round-robin' or 'queue-depth'");
>
> void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
> {
> @@ -128,6 +131,11 @@ void nvme_mpath_start_request(struct request *rq)
> struct nvme_ns *ns = rq->q->queuedata;
> struct gendisk *disk = ns->head->disk;
>
> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
> + atomic_inc(&ns->ctrl->nr_active);
> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
> + }
> +
> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
> return;
>
> @@ -140,6 +148,12 @@ EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
> void nvme_mpath_end_request(struct request *rq)
> {
> struct nvme_ns *ns = rq->q->queuedata;
> + int result;
> +
> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)) {
> + result = atomic_dec_if_positive(&ns->ctrl->nr_active);
> + WARN_ON_ONCE(result < 0);
> + }
>
> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
> return;

can we remove result variable ? that is only used once,
how about something like this unless there is something wrong with
totally untested :-

@@ -141,6 +149,9 @@ void nvme_mpath_end_request(struct request *rq)
 {
        struct nvme_ns *ns = rq->q->queuedata;

+       if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
+ WARN_ON_ONCE(atomic_dec_if_positive(&ns->ctrl->nr_active) < 0);
+
        if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
                return;
        bdev_end_io_acct(ns->head->disk->part0, req_op(rq),


> @@ -291,10 +305,15 @@ static struct nvme_ns *nvme_next_ns(struct nvme_ns_head *head,
> return list_first_or_null_rcu(&head->list, struct nvme_ns, siblings);
> }
>
> -static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> - int node, struct nvme_ns *old)
> +static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
> {
> - struct nvme_ns *ns, *found = NULL;
> + struct nvme_ns *ns, *old, *found = NULL;
> + int node = numa_node_id();
> +
> + old = srcu_dereference(head->current_path[node], &head->srcu);
> +

nit:- no need for white-line above ?

> + if (unlikely(!old))
> + return __nvme_find_path(head, node);
>
> if (list_is_singular(&head->list)) {
> if (nvme_path_is_disabled(old))
> @@ -334,13 +353,49 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> return found;
> }
>
> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> +{
> + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> + unsigned int depth;
> +
> + list_for_each_entry_rcu(ns, &head->list, siblings) {
> + if (nvme_path_is_disabled(ns))
> + continue;
> +
> + depth = atomic_read(&ns->ctrl->nr_active);
> +
> + switch (ns->ana_state) {
> + case NVME_ANA_OPTIMIZED:
> + if (depth < min_depth_opt) {
> + min_depth_opt = depth;
> + best_opt = ns;
> + }
> + break;
> + case NVME_ANA_NONOPTIMIZED:
> + if (depth < min_depth_nonopt) {
> + min_depth_nonopt = depth;
> + best_nonopt = ns;
> + }
> + break;
> + default:
> + break;
> + }
> +
> + if (min_depth_opt == 0)
> + return best_opt;
> + }
> +
> + return best_opt ? best_opt : best_nonopt;
> +}
> +
> static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
> {
> return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
> ns->ana_state == NVME_ANA_OPTIMIZED;
> }
>
> -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> +static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
> {
> int node = numa_node_id();
> struct nvme_ns *ns;
> @@ -349,13 +404,25 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> if (unlikely(!ns))
> return __nvme_find_path(head, node);
>
> - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
> - return nvme_round_robin_path(head, node, ns);
> if (unlikely(!nvme_path_is_optimized(ns)))
> return __nvme_find_path(head, node);
> +

nit:- do we really need above white line in this code ?

> return ns;
> }
>
> +
> +inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> +{
> + switch (READ_ONCE(head->subsys->iopolicy)) {
> + case NVME_IOPOLICY_QD:
> + return nvme_queue_depth_path(head);
> + case NVME_IOPOLICY_RR:
> + return nvme_round_robin_path(head);
> + default:
> + return nvme_numa_path(head);
> + }

should we use another case for NVME_IOPOLICY_NUMA that will call
nvme_numa_path() and report ratelimited error on the default lable
before settling on nvme_numa_path()?

something like this totally untested :-

inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
{
        switch (READ_ONCE(head->subsys->iopolicy)) {
        case NVME_IOPOLICY_QD:
                return nvme_queue_depth_path(head);
        case NVME_IOPOLICY_RR:
                return nvme_round_robin_path(head);
        case NVME_IOPOLICY_NUMA:
                return nvme_numa_path(head);
        default:
                dev_warn_ratelimited(disk_to_dev(head->disk),
                                     "invalid iopolicy %u",
head->subsys->iopolicy);
                return nvme_numa_path(head);
        }
}

but if it has already been discussed to not have default case
then please ignore this comment ...

> +}
> +
> static bool nvme_available_path(struct nvme_ns_head *head)
> {
> struct nvme_ns *ns;
> @@ -803,6 +870,28 @@ static ssize_t nvme_subsys_iopolicy_show(struct device *dev,
> nvme_iopolicy_names[READ_ONCE(subsys->iopolicy)]);
> }
>
> +static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
> + int iopolicy)
> +{
> + struct nvme_ctrl *ctrl;
> + int old_iopolicy = READ_ONCE(subsys->iopolicy);
> +
> + if (old_iopolicy == iopolicy)
> + return;
> +
> + WRITE_ONCE(subsys->iopolicy, iopolicy);
> +
> + /* iopolicy changes clear the mpath by design */
> + mutex_lock(&nvme_subsystems_lock);
> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
> + nvme_mpath_clear_ctrl_paths(ctrl);
> + mutex_unlock(&nvme_subsystems_lock);
> +
> + pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__,
> + nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy],
> + subsys->subnqn);
> +}
> +
> static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t count)
> {
> @@ -812,7 +901,7 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
>
> for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
> if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
> - WRITE_ONCE(subsys->iopolicy, i);
> + nvme_subsys_iopolicy_update(subsys, i);
> return count;
> }
> }
> @@ -923,6 +1012,9 @@ int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
> !(ctrl->subsys->cmic & NVME_CTRL_CMIC_ANA))
> return 0;
>
> + /* initialize this in the identify path to cover controller resets */

nit: If I'm not wrong, this function gets called from
|nvme_init_identify()|,
so it's pretty clear. That makes above comment kind of redundant ?
However, if others want that comment here, please ignore this message.

> + atomic_set(&ctrl->nr_active, 0);
> +
> if (!ctrl->max_namespaces ||
> ctrl->max_namespaces > le32_to_cpu(id->nn)) {
> dev_err(ctrl->device,
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 73442d3f504b..d6c1fe3e2832 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -50,6 +50,8 @@ extern struct workqueue_struct *nvme_wq;
> extern struct workqueue_struct *nvme_reset_wq;
> extern struct workqueue_struct *nvme_delete_wq;
>
> +extern struct mutex nvme_subsystems_lock;
> +
> /*
> * List of workarounds for devices that required behavior not specified in
> * the standard.
> @@ -195,6 +197,7 @@ enum {
> NVME_REQ_CANCELLED = (1 << 0),
> NVME_REQ_USERCMD = (1 << 1),
> NVME_MPATH_IO_STATS = (1 << 2),
> + NVME_MPATH_CNT_ACTIVE = (1 << 3),

nit:- please align above to existing code ...

> };
>
> static inline struct nvme_request *nvme_req(struct request *req)
> @@ -360,6 +363,7 @@ struct nvme_ctrl {
> size_t ana_log_size;
> struct timer_list anatt_timer;
> struct work_struct ana_work;
> + atomic_t nr_active;
> #endif
>
> #ifdef CONFIG_NVME_HOST_AUTH
> @@ -408,6 +412,7 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
> enum nvme_iopolicy {
> NVME_IOPOLICY_NUMA,
> NVME_IOPOLICY_RR,
> + NVME_IOPOLICY_QD,
> };
>
> struct nvme_subsystem {

-ck


2024-06-12 06:11:35

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v6 1/1] nvme-multipath: implement "queue-depth" iopolicy

On 6/12/24 02:20, John Meneghini wrote:
> From: Thomas Song <[email protected]>
>
> The round-robin path selector is inefficient in cases where there is a
> difference in latency between paths. In the presence of one or more
> high latency paths the round-robin selector continues to use the high
> latency path equally. This results in a bias towards the highest latency
> path and can cause a significant decrease in overall performance as IOs
> pile on the highest latency path. This problem is acute with NVMe-oF
> controllers.
>
> The queue-depth policy instead sends I/O requests down the path with the
> least amount of requests in its request queue. Paths with lower latency
> will clear requests more quickly and have less requests in their queues
> compared to higher latency paths. The goal of this path selector is to
> make more use of lower latency paths which will bring down overall IO
> latency and increase throughput and performance.
>
> Signed-off-by: Thomas Song <[email protected]>
> [emilne: patch developed by Thomas Song @ Pure Storage, fixed whitespace
> and compilation warnings, updated MODULE_PARM description, and
> fixed potential issue with ->current_path[] being used]
> Co-developed-by: Ewan D. Milne <[email protected]>
> Signed-off-by: Ewan D. Milne <[email protected]>
> [jmeneghi: vairious changes and improvements, addressed review comments]
> Co-developed-by: John Meneghini <[email protected]>
> Signed-off-by: John Meneghini <[email protected]>
> Link: https://lore.kernel.org/linux-nvme/[email protected]/
> Tested-by: Marco Patalano <[email protected]>
> Reviewed-by: Randy Jennings <[email protected]>
> Tested-by: Jyoti Rani <[email protected]>
> ---
> drivers/nvme/host/core.c | 2 +-
> drivers/nvme/host/multipath.c | 108 +++++++++++++++++++++++++++++++---
> drivers/nvme/host/nvme.h | 5 ++
> 3 files changed, 106 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7c9f91314d36..c10ff8815d82 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
> EXPORT_SYMBOL_GPL(nvme_delete_wq);
>
> static LIST_HEAD(nvme_subsystems);
> -static DEFINE_MUTEX(nvme_subsystems_lock);
> +DEFINE_MUTEX(nvme_subsystems_lock);
>
> static DEFINE_IDA(nvme_instance_ida);
> static dev_t nvme_ctrl_base_chr_devt;
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 03a6868f4dbc..fe10e0cebcf0 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
> static const char *nvme_iopolicy_names[] = {
> [NVME_IOPOLICY_NUMA] = "numa",
> [NVME_IOPOLICY_RR] = "round-robin",
> + [NVME_IOPOLICY_QD] = "queue-depth",
> };
>
> static int iopolicy = NVME_IOPOLICY_NUMA;
> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
> iopolicy = NVME_IOPOLICY_NUMA;
> else if (!strncmp(val, "round-robin", 11))
> iopolicy = NVME_IOPOLICY_RR;
> + else if (!strncmp(val, "queue-depth", 11))
> + iopolicy = NVME_IOPOLICY_QD;
> else
> return -EINVAL;
>
> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
> &iopolicy, 0644);
> MODULE_PARM_DESC(iopolicy,
> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> + "Default multipath I/O policy; 'numa' (default), 'round-robin' or 'queue-depth'");
>
> void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
> {
> @@ -128,6 +131,11 @@ void nvme_mpath_start_request(struct request *rq)
> struct nvme_ns *ns = rq->q->queuedata;
> struct gendisk *disk = ns->head->disk;
>
> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
> + atomic_inc(&ns->ctrl->nr_active);
> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
> + }
> +
> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
> return;
>
> @@ -140,6 +148,12 @@ EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
> void nvme_mpath_end_request(struct request *rq)
> {
> struct nvme_ns *ns = rq->q->queuedata;
> + int result;
> +
> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)) {
> + result = atomic_dec_if_positive(&ns->ctrl->nr_active);
> + WARN_ON_ONCE(result < 0);
> + }
>
> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
> return;
> @@ -291,10 +305,15 @@ static struct nvme_ns *nvme_next_ns(struct nvme_ns_head *head,
> return list_first_or_null_rcu(&head->list, struct nvme_ns, siblings);
> }
>
> -static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> - int node, struct nvme_ns *old)
> +static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
> {
> - struct nvme_ns *ns, *found = NULL;
> + struct nvme_ns *ns, *old, *found = NULL;
> + int node = numa_node_id();
> +
> + old = srcu_dereference(head->current_path[node], &head->srcu);
> +
> + if (unlikely(!old))
> + return __nvme_find_path(head, node);
>
> if (list_is_singular(&head->list)) {
> if (nvme_path_is_disabled(old))
> @@ -334,13 +353,49 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> return found;
> }
>
> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> +{
> + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> + unsigned int depth;
> +
> + list_for_each_entry_rcu(ns, &head->list, siblings) {
> + if (nvme_path_is_disabled(ns))
> + continue;
> +
> + depth = atomic_read(&ns->ctrl->nr_active);
> +
> + switch (ns->ana_state) {
> + case NVME_ANA_OPTIMIZED:
> + if (depth < min_depth_opt) {
> + min_depth_opt = depth;
> + best_opt = ns;
> + }
> + break;
> + case NVME_ANA_NONOPTIMIZED:
> + if (depth < min_depth_nonopt) {
> + min_depth_nonopt = depth;
> + best_nonopt = ns;
> + }
> + break;
> + default:
> + break;
> + }
> +
> + if (min_depth_opt == 0)
> + return best_opt;
> + }
> +
> + return best_opt ? best_opt : best_nonopt;
> +}
> +
> static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
> {
> return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
> ns->ana_state == NVME_ANA_OPTIMIZED;
> }
>
> -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> +static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
> {
> int node = numa_node_id();
> struct nvme_ns *ns;
> @@ -349,13 +404,25 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> if (unlikely(!ns))
> return __nvme_find_path(head, node);
>
> - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
> - return nvme_round_robin_path(head, node, ns);
> if (unlikely(!nvme_path_is_optimized(ns)))
> return __nvme_find_path(head, node);
> +

Pointless newline.

But other than that:

Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-06-12 13:07:54

by John Meneghini

[permalink] [raw]
Subject: Re: [PATCH v6 1/1] nvme-multipath: implement "queue-depth" iopolicy


On 6/11/24 21:44, Chaitanya Kulkarni wrote:
>> /*
>> * List of workarounds for devices that required behavior not specified in
>> * the standard.
>> @@ -195,6 +197,7 @@ enum {
>> NVME_REQ_CANCELLED = (1 << 0),
>> NVME_REQ_USERCMD = (1 << 1),
>> NVME_MPATH_IO_STATS = (1 << 2),
>> + NVME_MPATH_CNT_ACTIVE = (1 << 3),
> nit:- please align above to existing code ...

Ok... there must be something wrong with my tab stop.... are we using 8 space tabs in linux?

Here's what I have in my .vimrc

set smartindent " always set smartindenting on
set autoindent " always set autoindenting on
set backspace=2 " Influences the working of <BS>, <Del>, CTRL-W and CTRL-U in Insert mode.
set noexpandtab " insert tabs instead of spaces
set textwidth=0 " Don't wrap words by default
set shiftwidth=4 smarttab " number of spaces to use for each step of indent
set tabstop=4 softtabstop=0 " setting tab stops to be different from the indentation width

What is recommended setting for these ?

set shiftwidth=8 ?
set tabstop=8 ?

/John