2024-05-14 17:53:44

by John Meneghini

[permalink] [raw]
Subject: [PATCH v4 0/6] block,nvme: queue-depth and latency I/O schedulers

Changes since V3:

I've included Ewan's queue-depth patches in this new series and rebased
everything on to nvme-6.10. Also addressed a few review comments and
modified the commit headers. The code is unchanged.

Changes since V2:

I've done quite a bit of work cleaning up these patches. There were a
number of checkpatch.pl problems as well as some compile time errors
when config BLK_NODE_LATENCY was turned off. After the clean up I
rebased these patches onto Ewan's "nvme: queue-depth multipath iopolicy"
patches. This allowed me to test both iopolicy changes together.

All of my test results, together with the scripts I used to generate these
graphs, are available at:

https://github.com/johnmeneghini/iopolicy

Please use the scripts in this repository to do your own testing.

Changes since V1:

Hi all,

there had been several attempts to implement a latency-based I/O
scheduler for native nvme multipath, all of which had its issues.

So time to start afresh, this time using the QoS framework
already present in the block layer.
It consists of two parts:
- a new 'blk-nlatency' QoS module, which is just a simple per-node
latency tracker
- a 'latency' nvme I/O policy

Using the 'tiobench' fio script with 512 byte blocksize I'm getting
the following latencies (in usecs) as a baseline:
- seq write: avg 186 stddev 331
- rand write: avg 4598 stddev 7903
- seq read: avg 149 stddev 65
- rand read: avg 150 stddev 68

Enabling the 'latency' iopolicy:
- seq write: avg 178 stddev 113
- rand write: avg 3427 stddev 6703
- seq read: avg 140 stddev 59
- rand read: avg 141 stddev 58

Setting the 'decay' parameter to 10:
- seq write: avg 182 stddev 65
- rand write: avg 2619 stddev 5894
- seq read: avg 142 stddev 57
- rand read: avg 140 stddev 57

That's on a 32G FC testbed running against a brd target,
fio running with 48 threads. So promises are met: latency
goes down, and we're even able to control the standard
deviation via the 'decay' parameter.

As usual, comments and reviews are welcome.

Changes to the original version:
- split the rqos debugfs entries
- Modify commit message to indicate latency
- rename to blk-nlatency

Ewan D. Milne (3):
nvme: multipath: Implemented new iopolicy "queue-depth"
nvme: multipath: only update ctrl->nr_active when using queue-depth
iopolicy
nvme: multipath: Invalidate current_path when changing iopolicy

Hannes Reinecke (2):
block: track per-node I/O latency
nvme: add 'latency' iopolicy

John Meneghini (1):
nvme: multipath: pr_notice when iopolicy changes

MAINTAINERS | 1 +
block/Kconfig | 9 +
block/Makefile | 1 +
block/blk-mq-debugfs.c | 2 +
block/blk-nlatency.c | 389 ++++++++++++++++++++++++++++++++++
block/blk-rq-qos.h | 6 +
drivers/nvme/host/core.c | 2 +-
drivers/nvme/host/multipath.c | 143 ++++++++++++-
drivers/nvme/host/nvme.h | 9 +
include/linux/blk-mq.h | 11 +
10 files changed, 563 insertions(+), 10 deletions(-)
create mode 100644 block/blk-nlatency.c

--
2.39.3



2024-05-14 17:53:59

by John Meneghini

[permalink] [raw]
Subject: [PATCH v4 2/6] nvme: multipath: only update ctrl->nr_active when using queue-depth iopolicy

From: "Ewan D. Milne" <[email protected]>

The atomic updates of ctrl->nr_active are unnecessary when using
numa or round-robin iopolicy, so avoid that cost on a per-request basis.
Clear nr_active when changing iopolicy and do not decrement below zero.
(This handles changing the iopolicy while requests are in flight.)

Tested-by: John Meneghini <[email protected]>
Signed-off-by: Ewan D. Milne <[email protected]>
---
drivers/nvme/host/core.c | 2 +-
drivers/nvme/host/multipath.c | 21 ++++++++++++++++++---
drivers/nvme/host/nvme.h | 6 ++++++
3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a066429b790d..1dd7c52293ff 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
EXPORT_SYMBOL_GPL(nvme_delete_wq);

static LIST_HEAD(nvme_subsystems);
-static DEFINE_MUTEX(nvme_subsystems_lock);
+DEFINE_MUTEX(nvme_subsystems_lock);

static DEFINE_IDA(nvme_instance_ida);
static dev_t nvme_ctrl_base_chr_devt;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 9e36002d0831..1e9338543ded 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -133,7 +133,8 @@ void nvme_mpath_start_request(struct request *rq)
if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
return;

- atomic_inc(&ns->ctrl->nr_active);
+ if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD)
+ atomic_inc(&ns->ctrl->nr_active);
nvme_req(rq)->flags |= NVME_MPATH_IO_STATS;
nvme_req(rq)->start_time = bdev_start_io_acct(disk->part0, req_op(rq),
jiffies);
@@ -147,7 +148,8 @@ void nvme_mpath_end_request(struct request *rq)
if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
return;

- atomic_dec(&ns->ctrl->nr_active);
+ if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD)
+ atomic_dec_if_positive(&ns->ctrl->nr_active);
bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
blk_rq_bytes(rq) >> SECTOR_SHIFT,
nvme_req(rq)->start_time);
@@ -850,6 +852,19 @@ static ssize_t nvme_subsys_iopolicy_show(struct device *dev,
nvme_iopolicy_names[READ_ONCE(subsys->iopolicy)]);
}

+void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
+{
+ struct nvme_ctrl *ctrl;
+
+ WRITE_ONCE(subsys->iopolicy, iopolicy);
+
+ mutex_lock(&nvme_subsystems_lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ atomic_set(&ctrl->nr_active, 0);
+ }
+ mutex_unlock(&nvme_subsystems_lock);
+}
+
static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
{
@@ -859,7 +874,7 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,

for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
- WRITE_ONCE(subsys->iopolicy, i);
+ nvme_subsys_iopolicy_update(subsys, i);
return count;
}
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e7d0a56d35d4..4e876524726a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -50,6 +50,8 @@ extern struct workqueue_struct *nvme_wq;
extern struct workqueue_struct *nvme_reset_wq;
extern struct workqueue_struct *nvme_delete_wq;

+extern struct mutex nvme_subsystems_lock;
+
/*
* List of workarounds for devices that required behavior not specified in
* the standard.
@@ -937,6 +939,7 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
void nvme_mpath_shutdown_disk(struct nvme_ns_head *head);
void nvme_mpath_start_request(struct request *rq);
void nvme_mpath_end_request(struct request *rq);
+void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy);

static inline void nvme_trace_bio_complete(struct request *req)
{
@@ -1036,6 +1039,9 @@ static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
{
return false;
}
+static inline void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
+{
+}
#endif /* CONFIG_NVME_MULTIPATH */

int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
--
2.39.3


2024-05-14 17:54:17

by John Meneghini

[permalink] [raw]
Subject: [PATCH v4 3/6] nvme: multipath: Invalidate current_path when changing iopolicy

From: "Ewan D. Milne" <[email protected]>

When switching back to numa from round-robin, current_path may refer to
a different path than the one numa would have selected, and it is desirable
to have consistent behavior.

Tested-by: John Meneghini <[email protected]>
Signed-off-by: Ewan D. Milne <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/nvme/host/multipath.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1e9338543ded..8702a40a1971 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -861,6 +861,7 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
mutex_lock(&nvme_subsystems_lock);
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
atomic_set(&ctrl->nr_active, 0);
+ nvme_mpath_clear_ctrl_paths(ctrl);
}
mutex_unlock(&nvme_subsystems_lock);
}
--
2.39.3


2024-05-14 17:54:33

by John Meneghini

[permalink] [raw]
Subject: [PATCH v4 6/6] nvme: multipath: pr_notice when iopolicy changes

Send a pr_notice when ever the iopolicy on a subsystem
is changed. This is important for support reasons. It
is fully expected that users will be changing the iopolicy
with active IO in progress.

Signed-off-by: John Meneghini <[email protected]>
---
drivers/nvme/host/multipath.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index e9330bb1990b..0286e44a081f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -67,6 +67,10 @@ static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopolicy)
}
}
mutex_unlock(&subsys->lock);
+
+ pr_notice("%s: %s enable %d status %d for subsysnqn %s\n", __func__,
+ nvme_iopolicy_names[iopolicy], enable, ret, subsys->subnqn);
+
return ret;
}

@@ -890,6 +894,8 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
{
struct nvme_ctrl *ctrl;

+ int old_iopolicy = READ_ONCE(subsys->iopolicy);
+
WRITE_ONCE(subsys->iopolicy, iopolicy);

mutex_lock(&nvme_subsystems_lock);
@@ -898,6 +904,10 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
nvme_mpath_clear_ctrl_paths(ctrl);
}
mutex_unlock(&nvme_subsystems_lock);
+
+ pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__,
+ nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy],
+ subsys->subnqn);
}

static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
--
2.39.3


2024-05-14 17:54:50

by John Meneghini

[permalink] [raw]
Subject: [PATCH v4 5/6] nvme: add 'latency' iopolicy

From: Hannes Reinecke <[email protected]>

Add a latency-based I/O policy for multipathing. It uses the blk-nodelat
latency tracker to provide latencies for each node, and schedules
I/O on the path with the least latency for the submitting node.

Signed-off-by: Hannes Reinecke <[email protected]>
[jmeneghi: fix CONFIG_BLK_NODE_LATENCY n and add latency iopolicy to modinfo]
Signed-off-by: John Meneghini <[email protected]>
---
drivers/nvme/host/multipath.c | 62 ++++++++++++++++++++++++++++++-----
drivers/nvme/host/nvme.h | 1 +
2 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 8702a40a1971..e9330bb1990b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -18,6 +18,7 @@ static const char *nvme_iopolicy_names[] = {
[NVME_IOPOLICY_NUMA] = "numa",
[NVME_IOPOLICY_RR] = "round-robin",
[NVME_IOPOLICY_QD] = "queue-depth",
+ [NVME_IOPOLICY_LAT] = "latency",
};

static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -32,6 +33,10 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
iopolicy = NVME_IOPOLICY_RR;
else if (!strncmp(val, "queue-depth", 11))
iopolicy = NVME_IOPOLICY_QD;
+#ifdef CONFIG_BLK_NODE_LATENCY
+ else if (!strncmp(val, "latency", 7))
+ iopolicy = NVME_IOPOLICY_LAT;
+#endif
else
return -EINVAL;

@@ -43,10 +48,36 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
return sprintf(buf, "%s\n", nvme_iopolicy_names[iopolicy]);
}

+static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopolicy)
+{
+ struct nvme_ns_head *h;
+ struct nvme_ns *ns;
+ bool enable = iopolicy == NVME_IOPOLICY_LAT;
+ int ret = 0;
+
+ mutex_lock(&subsys->lock);
+ list_for_each_entry(h, &subsys->nsheads, entry) {
+ list_for_each_entry_rcu(ns, &h->list, siblings) {
+ if (enable) {
+ ret = blk_nlat_enable(ns->disk);
+ if (ret)
+ break;
+ } else
+ blk_nlat_disable(ns->disk);
+ }
+ }
+ mutex_unlock(&subsys->lock);
+ return ret;
+}
+
module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
&iopolicy, 0644);
MODULE_PARM_DESC(iopolicy,
+#if defined(CONFIG_BLK_NODE_LATENCY)
+ "Default multipath I/O policy; 'numa' (default) , 'round-robin', 'queue-depth' or 'latency'");
+#else
"Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
+#endif

void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
{
@@ -250,13 +281,16 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
{
int found_distance = INT_MAX, fallback_distance = INT_MAX, distance;
struct nvme_ns *found = NULL, *fallback = NULL, *ns;
+ int iopolicy = READ_ONCE(head->subsys->iopolicy);

list_for_each_entry_rcu(ns, &head->list, siblings) {
if (nvme_path_is_disabled(ns))
continue;

- if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA)
+ if (iopolicy == NVME_IOPOLICY_NUMA)
distance = node_distance(node, ns->ctrl->numa_node);
+ else if (iopolicy == NVME_IOPOLICY_LAT)
+ distance = blk_nlat_latency(ns->disk, node);
else
distance = LOCAL_DISTANCE;

@@ -380,8 +414,8 @@ static inline bool nvme_path_is_optimized(struct nvme_ns *ns)

inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
{
- int iopolicy = READ_ONCE(head->subsys->iopolicy);
int node;
+ int iopolicy = READ_ONCE(head->subsys->iopolicy);
struct nvme_ns *ns;

/*
@@ -400,8 +434,8 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)

if (iopolicy == NVME_IOPOLICY_RR)
return nvme_round_robin_path(head, node, ns);
-
- if (unlikely(!nvme_path_is_optimized(ns)))
+ if (iopolicy == NVME_IOPOLICY_LAT ||
+ unlikely(!nvme_path_is_optimized(ns)))
return __nvme_find_path(head, node);
return ns;
}
@@ -871,15 +905,18 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
{
struct nvme_subsystem *subsys =
container_of(dev, struct nvme_subsystem, dev);
- int i;
+ int i, ret;

for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
- nvme_subsys_iopolicy_update(subsys, i);
- return count;
+ ret = nvme_activate_iopolicy(subsys, i);
+ if (!ret) {
+ nvme_subsys_iopolicy_update(subsys, i);
+ return count;
+ }
+ return ret;
}
}
-
return -EINVAL;
}
SUBSYS_ATTR_RW(iopolicy, S_IRUGO | S_IWUSR,
@@ -915,6 +952,15 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,

void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
{
+ if (!blk_nlat_init(ns->disk) &&
+ READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_LAT) {
+ int ret = blk_nlat_enable(ns->disk);
+
+ if (unlikely(ret))
+ pr_warn("%s: Failed to enable latency tracking, error %d\n",
+ ns->disk->disk_name, ret);
+ }
+
if (nvme_ctrl_use_ana(ns->ctrl)) {
struct nvme_ana_group_desc desc = {
.grpid = anagrpid,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 4e876524726a..56b78f21406a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -406,6 +406,7 @@ enum nvme_iopolicy {
NVME_IOPOLICY_NUMA,
NVME_IOPOLICY_RR,
NVME_IOPOLICY_QD,
+ NVME_IOPOLICY_LAT,
};

struct nvme_subsystem {
--
2.39.3


2024-05-14 17:55:00

by John Meneghini

[permalink] [raw]
Subject: [PATCH v4 4/6] block: track per-node I/O latency

From: Hannes Reinecke <[email protected]>

Add a new option 'BLK_NODE_LATENCY' to track per-node I/O latency.
This can be used by I/O schedulers to determine the 'best' queue
to send I/O to.

Signed-off-by: Hannes Reinecke <[email protected]>
[jmeneghi: cleaned up checkpatch warnings and updated MAINTAINERS]
Signed-off-by: John Meneghini <[email protected]>
---
MAINTAINERS | 1 +
block/Kconfig | 9 +
block/Makefile | 1 +
block/blk-mq-debugfs.c | 2 +
block/blk-nlatency.c | 389 +++++++++++++++++++++++++++++++++++++++++
block/blk-rq-qos.h | 6 +
include/linux/blk-mq.h | 11 ++
7 files changed, 419 insertions(+)
create mode 100644 block/blk-nlatency.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7c121493f43d..a4634365c82f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5405,6 +5405,7 @@ F: block/bfq-cgroup.c
F: block/blk-cgroup.c
F: block/blk-iocost.c
F: block/blk-iolatency.c
+F: block/blk-nlatency.c
F: block/blk-throttle.c
F: include/linux/blk-cgroup.h

diff --git a/block/Kconfig b/block/Kconfig
index d47398ae9824..d8edb4506769 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -185,6 +185,15 @@ config BLK_CGROUP_IOPRIO
scheduler and block devices process requests. Only some I/O schedulers
and some block devices support I/O priorities.

+config BLK_NODE_LATENCY
+ bool "Track per-node I/O latency"
+ help
+ Enable per-node I/O latency tracking for multipathing. This uses the
+ blk-nodelat latency tracker to provide latencies for each node, and schedules
+ I/O on the path with the least latency for the submitting node. This can be
+ used by I/O schedulers to determine the node with the least latency. Currently
+ only supports nvme over fabrics devices.
+
config BLK_DEBUG_FS
bool "Block layer debugging information in debugfs"
default y
diff --git a/block/Makefile b/block/Makefile
index 168150b9c510..043d979de8fe 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOPRIO) += blk-ioprio.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
+obj-$(CONFIG_BLK_NODE_LATENCY) += blk-nlatency.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 770c0c2b72fa..bc2541428e81 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -761,6 +761,8 @@ static const char *rq_qos_id_to_name(enum rq_qos_id id)
return "latency";
case RQ_QOS_COST:
return "cost";
+ case RQ_QOS_NLAT:
+ return "node-latency";
}
return "unknown";
}
diff --git a/block/blk-nlatency.c b/block/blk-nlatency.c
new file mode 100644
index 000000000000..219c3f636d76
--- /dev/null
+++ b/block/blk-nlatency.c
@@ -0,0 +1,389 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per-node request latency tracking.
+ *
+ * Copyright (C) 2023 Hannes Reinecke
+ *
+ * A simple per-node latency tracker for use by I/O scheduler.
+ * Latencies are measures over 'win_usec' microseconds and stored per node.
+ * If the number of measurements falls below 'lowat' the measurement is
+ * assumed to be unreliable and will become 'stale'.
+ * These 'stale' latencies can be 'decayed', where during each measurement
+ * interval the 'stale' latency value is decreased by 'decay' percent.
+ * Once the 'stale' latency reaches zero it will be updated by the
+ * measured latency.
+ */
+#include <linux/kernel.h>
+#include <linux/blk_types.h>
+#include <linux/slab.h>
+
+#include "blk-stat.h"
+#include "blk-rq-qos.h"
+#include "blk.h"
+
+#define NLAT_DEFAULT_LOWAT 2
+#define NLAT_DEFAULT_DECAY 50
+
+struct rq_nlat {
+ struct rq_qos rqos;
+
+ u64 win_usec; /* latency measurement window in microseconds */
+ unsigned int lowat; /* Low Watermark latency measurement */
+ unsigned int decay; /* Percentage for 'decaying' latencies */
+ bool enabled;
+
+ struct blk_stat_callback *cb;
+
+ unsigned int num;
+ u64 *latency;
+ unsigned int *samples;
+};
+
+static inline struct rq_nlat *RQNLAT(struct rq_qos *rqos)
+{
+ return container_of(rqos, struct rq_nlat, rqos);
+}
+
+static u64 nlat_default_latency_usec(struct request_queue *q)
+{
+ /*
+ * We default to 2msec for non-rotational storage, and 75msec
+ * for rotational storage.
+ */
+ if (blk_queue_nonrot(q))
+ return 2000ULL;
+ else
+ return 75000ULL;
+}
+
+static void nlat_timer_fn(struct blk_stat_callback *cb)
+{
+ struct rq_nlat *nlat = cb->data;
+ int n;
+
+ for (n = 0; n < cb->buckets; n++) {
+ if (cb->stat[n].nr_samples < nlat->lowat) {
+ /*
+ * 'decay' the latency by the specified
+ * percentage to ensure the queues are
+ * being tested to balance out temporary
+ * latency spikes.
+ */
+ nlat->latency[n] =
+ div64_u64(nlat->latency[n] * nlat->decay, 100);
+ } else
+ nlat->latency[n] = cb->stat[n].mean;
+ nlat->samples[n] = cb->stat[n].nr_samples;
+ }
+ if (nlat->enabled)
+ blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+}
+
+static int nlat_bucket_node(const struct request *rq)
+{
+ if (!rq->mq_ctx)
+ return -1;
+ return cpu_to_node(blk_mq_rq_cpu((struct request *)rq));
+}
+
+static void nlat_exit(struct rq_qos *rqos)
+{
+ struct rq_nlat *nlat = RQNLAT(rqos);
+
+ blk_stat_remove_callback(nlat->rqos.disk->queue, nlat->cb);
+ blk_stat_free_callback(nlat->cb);
+ kfree(nlat->samples);
+ kfree(nlat->latency);
+ kfree(nlat);
+}
+
+#ifdef CONFIG_BLK_DEBUG_FS
+static int nlat_win_usec_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_nlat *nlat = RQNLAT(rqos);
+
+ seq_printf(m, "%llu\n", nlat->win_usec);
+ return 0;
+}
+
+static ssize_t nlat_win_usec_write(void *data, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct rq_qos *rqos = data;
+ struct rq_nlat *nlat = RQNLAT(rqos);
+ char val[16] = { };
+ u64 usec;
+ int err;
+
+ if (blk_queue_dying(nlat->rqos.disk->queue))
+ return -ENOENT;
+
+ if (count >= sizeof(val))
+ return -EINVAL;
+
+ if (copy_from_user(val, buf, count))
+ return -EFAULT;
+
+ err = kstrtoull(val, 10, &usec);
+ if (err)
+ return err;
+ blk_stat_deactivate(nlat->cb);
+ nlat->win_usec = usec;
+ blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+
+ return count;
+}
+
+static int nlat_lowat_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_nlat *nlat = RQNLAT(rqos);
+
+ seq_printf(m, "%u\n", nlat->lowat);
+ return 0;
+}
+
+static ssize_t nlat_lowat_write(void *data, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct rq_qos *rqos = data;
+ struct rq_nlat *nlat = RQNLAT(rqos);
+ char val[16] = { };
+ unsigned int lowat;
+ int err;
+
+ if (blk_queue_dying(nlat->rqos.disk->queue))
+ return -ENOENT;
+
+ if (count >= sizeof(val))
+ return -EINVAL;
+
+ if (copy_from_user(val, buf, count))
+ return -EFAULT;
+
+ err = kstrtouint(val, 10, &lowat);
+ if (err)
+ return err;
+ blk_stat_deactivate(nlat->cb);
+ nlat->lowat = lowat;
+ blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+
+ return count;
+}
+
+static int nlat_decay_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_nlat *nlat = RQNLAT(rqos);
+
+ seq_printf(m, "%u\n", nlat->decay);
+ return 0;
+}
+
+static ssize_t nlat_decay_write(void *data, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct rq_qos *rqos = data;
+ struct rq_nlat *nlat = RQNLAT(rqos);
+ char val[16] = { };
+ unsigned int decay;
+ int err;
+
+ if (blk_queue_dying(nlat->rqos.disk->queue))
+ return -ENOENT;
+
+ if (count >= sizeof(val))
+ return -EINVAL;
+
+ if (copy_from_user(val, buf, count))
+ return -EFAULT;
+
+ err = kstrtouint(val, 10, &decay);
+ if (err)
+ return err;
+ if (decay > 100)
+ return -EINVAL;
+ blk_stat_deactivate(nlat->cb);
+ nlat->decay = decay;
+ blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+
+ return count;
+}
+
+static int nlat_enabled_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_nlat *nlat = RQNLAT(rqos);
+
+ seq_printf(m, "%d\n", nlat->enabled);
+ return 0;
+}
+
+static int nlat_id_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+
+ seq_printf(m, "%u\n", rqos->id);
+ return 0;
+}
+
+static int nlat_latency_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_nlat *nlat = RQNLAT(rqos);
+ int n;
+
+ if (!nlat->enabled)
+ return 0;
+
+ for (n = 0; n < nlat->num; n++) {
+ if (n > 0)
+ seq_puts(m, " ");
+ seq_printf(m, "%llu", nlat->latency[n]);
+ }
+ seq_puts(m, "\n");
+ return 0;
+}
+
+static int nlat_samples_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_nlat *nlat = RQNLAT(rqos);
+ int n;
+
+ if (!nlat->enabled)
+ return 0;
+
+ for (n = 0; n < nlat->num; n++) {
+ if (n > 0)
+ seq_puts(m, " ");
+ seq_printf(m, "%u", nlat->samples[n]);
+ }
+ seq_puts(m, "\n");
+ return 0;
+}
+
+static const struct blk_mq_debugfs_attr nlat_debugfs_attrs[] = {
+ {"win_usec", 0600, nlat_win_usec_show, nlat_win_usec_write},
+ {"lowat", 0600, nlat_lowat_show, nlat_lowat_write},
+ {"decay", 0600, nlat_decay_show, nlat_decay_write},
+ {"enabled", 0400, nlat_enabled_show},
+ {"id", 0400, nlat_id_show},
+ {"latency", 0400, nlat_latency_show},
+ {"samples", 0400, nlat_samples_show},
+ {},
+};
+#endif
+
+static const struct rq_qos_ops nlat_rqos_ops = {
+ .exit = nlat_exit,
+#ifdef CONFIG_BLK_DEBUG_FS
+ .debugfs_attrs = nlat_debugfs_attrs,
+#endif
+};
+
+u64 blk_nlat_latency(struct gendisk *disk, int node)
+{
+ struct rq_qos *rqos;
+ struct rq_nlat *nlat;
+
+ rqos = nlat_rq_qos(disk->queue);
+ if (!rqos)
+ return 0;
+ nlat = RQNLAT(rqos);
+ if (node > nlat->num)
+ return 0;
+
+ return div64_u64(nlat->latency[node], 1000);
+}
+EXPORT_SYMBOL_GPL(blk_nlat_latency);
+
+int blk_nlat_enable(struct gendisk *disk)
+{
+ struct rq_qos *rqos;
+ struct rq_nlat *nlat;
+
+ /* Latency tracking not enabled? */
+ rqos = nlat_rq_qos(disk->queue);
+ if (!rqos)
+ return -EINVAL;
+ nlat = RQNLAT(rqos);
+ if (nlat->enabled)
+ return 0;
+
+ /* Queue not registered? Maybe shutting down... */
+ if (!blk_queue_registered(disk->queue))
+ return -EAGAIN;
+
+ nlat->enabled = true;
+ memset(nlat->latency, 0, sizeof(u64) * nlat->num);
+ memset(nlat->samples, 0, sizeof(unsigned int) * nlat->num);
+ blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(blk_nlat_enable);
+
+void blk_nlat_disable(struct gendisk *disk)
+{
+ struct rq_qos *rqos = nlat_rq_qos(disk->queue);
+ struct rq_nlat *nlat;
+
+ if (!rqos)
+ return;
+ nlat = RQNLAT(rqos);
+ if (nlat->enabled) {
+ blk_stat_deactivate(nlat->cb);
+ nlat->enabled = false;
+ }
+}
+EXPORT_SYMBOL_GPL(blk_nlat_disable);
+
+int blk_nlat_init(struct gendisk *disk)
+{
+ struct rq_nlat *nlat;
+ int ret = -ENOMEM;
+
+ nlat = kzalloc(sizeof(*nlat), GFP_KERNEL);
+ if (!nlat)
+ return -ENOMEM;
+
+ nlat->num = num_possible_nodes();
+ nlat->lowat = NLAT_DEFAULT_LOWAT;
+ nlat->decay = NLAT_DEFAULT_DECAY;
+ nlat->win_usec = nlat_default_latency_usec(disk->queue);
+
+ nlat->latency = kcalloc(nlat->num, sizeof(u64), GFP_KERNEL);
+ if (!nlat->latency)
+ goto err_free;
+ nlat->samples = kcalloc(nlat->num, sizeof(unsigned int), GFP_KERNEL);
+ if (!nlat->samples)
+ goto err_free;
+ nlat->cb = blk_stat_alloc_callback(nlat_timer_fn, nlat_bucket_node,
+ nlat->num, nlat);
+ if (!nlat->cb)
+ goto err_free;
+
+ /*
+ * Assign rwb and add the stats callback.
+ */
+ mutex_lock(&disk->queue->rq_qos_mutex);
+ ret = rq_qos_add(&nlat->rqos, disk, RQ_QOS_NLAT, &nlat_rqos_ops);
+ mutex_unlock(&disk->queue->rq_qos_mutex);
+ if (ret)
+ goto err_free_cb;
+
+ blk_stat_add_callback(disk->queue, nlat->cb);
+
+ return 0;
+
+err_free_cb:
+ blk_stat_free_callback(nlat->cb);
+err_free:
+ kfree(nlat->samples);
+ kfree(nlat->latency);
+ kfree(nlat);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blk_nlat_init);
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 37245c97ee61..2fc11ced0c00 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -17,6 +17,7 @@ enum rq_qos_id {
RQ_QOS_WBT,
RQ_QOS_LATENCY,
RQ_QOS_COST,
+ RQ_QOS_NLAT,
};

struct rq_wait {
@@ -79,6 +80,11 @@ static inline struct rq_qos *iolat_rq_qos(struct request_queue *q)
return rq_qos_id(q, RQ_QOS_LATENCY);
}

+static inline struct rq_qos *nlat_rq_qos(struct request_queue *q)
+{
+ return rq_qos_id(q, RQ_QOS_NLAT);
+}
+
static inline void rq_wait_init(struct rq_wait *rq_wait)
{
atomic_set(&rq_wait->inflight, 0);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 89ba6b16fe8b..925e8c19bedb 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1150,4 +1150,15 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
}
void blk_dump_rq_flags(struct request *, char *);

+#ifdef CONFIG_BLK_NODE_LATENCY
+int blk_nlat_enable(struct gendisk *disk);
+void blk_nlat_disable(struct gendisk *disk);
+u64 blk_nlat_latency(struct gendisk *disk, int node);
+int blk_nlat_init(struct gendisk *disk);
+#else
+static inline int blk_nlat_enable(struct gendisk *disk) { return 0; }
+static inline void blk_nlat_disable(struct gendisk *disk) {}
+static inline u64 blk_nlat_latency(struct gendisk *disk, int node) { return 0; }
+static inline int blk_nlat_init(struct gendisk *disk) { return -EOPNOTSUPP; }
+#endif
#endif /* BLK_MQ_H */
--
2.39.3


2024-05-14 17:55:16

by John Meneghini

[permalink] [raw]
Subject: [PATCH v4 1/6] nvme: multipath: Implemented new iopolicy "queue-depth"

From: "Ewan D. Milne" <[email protected]>

The existing iopolicies are inefficient in some cases, such as
the presence of a path with high latency. The round-robin
policy would use that path equally with faster paths, which
results in sub-optimal performance.

The queue-depth policy instead sends I/O requests down the path
with the least amount of requests in its request queue. Paths
with lower latency will clear requests more quickly and have less
requests in their queues compared to "bad" paths. The aim is to
use those paths the most to bring down overall latency.

This implementation adds an atomic variable to the nvme_ctrl
struct to represent the queue depth. It is updated each time a
request specific to that controller starts or ends.

[edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace
and compilation warnings, updated MODULE_PARM description, and
fixed potential issue with ->current_path[] being used]

Tested-by: John Meneghini <[email protected]>
Co-developed-by: Thomas Song <[email protected]>
Signed-off-by: Thomas Song <[email protected]>
Signed-off-by: Ewan D. Milne <[email protected]>
---
drivers/nvme/host/multipath.c | 59 +++++++++++++++++++++++++++++++++--
drivers/nvme/host/nvme.h | 2 ++
2 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 5397fb428b24..9e36002d0831 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
static const char *nvme_iopolicy_names[] = {
[NVME_IOPOLICY_NUMA] = "numa",
[NVME_IOPOLICY_RR] = "round-robin",
+ [NVME_IOPOLICY_QD] = "queue-depth",
};

static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
iopolicy = NVME_IOPOLICY_NUMA;
else if (!strncmp(val, "round-robin", 11))
iopolicy = NVME_IOPOLICY_RR;
+ else if (!strncmp(val, "queue-depth", 11))
+ iopolicy = NVME_IOPOLICY_QD;
else
return -EINVAL;

@@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
&iopolicy, 0644);
MODULE_PARM_DESC(iopolicy,
- "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
+ "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");

void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
{
@@ -130,6 +133,7 @@ void nvme_mpath_start_request(struct request *rq)
if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
return;

+ atomic_inc(&ns->ctrl->nr_active);
nvme_req(rq)->flags |= NVME_MPATH_IO_STATS;
nvme_req(rq)->start_time = bdev_start_io_acct(disk->part0, req_op(rq),
jiffies);
@@ -142,6 +146,8 @@ void nvme_mpath_end_request(struct request *rq)

if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
return;
+
+ atomic_dec(&ns->ctrl->nr_active);
bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
blk_rq_bytes(rq) >> SECTOR_SHIFT,
nvme_req(rq)->start_time);
@@ -330,6 +336,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
return found;
}

+static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
+{
+ struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+ unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+ unsigned int depth;
+
+ list_for_each_entry_rcu(ns, &head->list, siblings) {
+ if (nvme_path_is_disabled(ns))
+ continue;
+
+ depth = atomic_read(&ns->ctrl->nr_active);
+
+ switch (ns->ana_state) {
+ case NVME_ANA_OPTIMIZED:
+ if (depth < min_depth_opt) {
+ min_depth_opt = depth;
+ best_opt = ns;
+ }
+ break;
+
+ case NVME_ANA_NONOPTIMIZED:
+ if (depth < min_depth_nonopt) {
+ min_depth_nonopt = depth;
+ best_nonopt = ns;
+ }
+ break;
+ default:
+ break;
+ }
+ }
+
+ return best_opt ? best_opt : best_nonopt;
+}
+
static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
{
return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
@@ -338,15 +378,27 @@ static inline bool nvme_path_is_optimized(struct nvme_ns *ns)

inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
{
- int node = numa_node_id();
+ int iopolicy = READ_ONCE(head->subsys->iopolicy);
+ int node;
struct nvme_ns *ns;

+ /*
+ * queue-depth iopolicy does not need to reference ->current_path
+ * but round-robin needs the last path used to advance to the
+ * next one, and numa will continue to use the last path unless
+ * it is or has become not optimized
+ */
+ if (iopolicy == NVME_IOPOLICY_QD)
+ return nvme_queue_depth_path(head);
+
+ node = numa_node_id();
ns = srcu_dereference(head->current_path[node], &head->srcu);
if (unlikely(!ns))
return __nvme_find_path(head, node);

- if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
+ if (iopolicy == NVME_IOPOLICY_RR)
return nvme_round_robin_path(head, node, ns);
+
if (unlikely(!nvme_path_is_optimized(ns)))
return __nvme_find_path(head, node);
return ns;
@@ -905,6 +957,7 @@ void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl)
mutex_init(&ctrl->ana_lock);
timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
INIT_WORK(&ctrl->ana_work, nvme_ana_work);
+ atomic_set(&ctrl->nr_active, 0);
}

int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f243a5822c2b..e7d0a56d35d4 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -354,6 +354,7 @@ struct nvme_ctrl {
size_t ana_log_size;
struct timer_list anatt_timer;
struct work_struct ana_work;
+ atomic_t nr_active;
#endif

#ifdef CONFIG_NVME_HOST_AUTH
@@ -402,6 +403,7 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
enum nvme_iopolicy {
NVME_IOPOLICY_NUMA,
NVME_IOPOLICY_RR,
+ NVME_IOPOLICY_QD,
};

struct nvme_subsystem {
--
2.39.3


2024-05-20 14:49:18

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v4 1/6] nvme: multipath: Implemented new iopolicy "queue-depth"

On Tue, May 14, 2024 at 01:53:17PM -0400, John Meneghini wrote:
> @@ -130,6 +133,7 @@ void nvme_mpath_start_request(struct request *rq)
> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
> return;
>
> + atomic_inc(&ns->ctrl->nr_active);

Why skip passthrough and stats?

And I think you should squash the follow up patch that constrains the
atomics to the queue-depth path selector.

> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> +{
> + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> + unsigned int depth;
> +
> + list_for_each_entry_rcu(ns, &head->list, siblings) {
> + if (nvme_path_is_disabled(ns))
> + continue;
> +
> + depth = atomic_read(&ns->ctrl->nr_active);
> +
> + switch (ns->ana_state) {
> + case NVME_ANA_OPTIMIZED:
> + if (depth < min_depth_opt) {
> + min_depth_opt = depth;
> + best_opt = ns;
> + }
> + break;
> +
> + case NVME_ANA_NONOPTIMIZED:
> + if (depth < min_depth_nonopt) {
> + min_depth_nonopt = depth;
> + best_nonopt = ns;
> + }
> + break;
> + default:
> + break;
> + }
> + }
> +

I think you can do the atomic_inc here so you don't have to check the io
policy a 2nd time.

> + return best_opt ? best_opt : best_nonopt;
> +}