LinuxLists.cc - [PATCH] iosched: Add i10 I/O Scheduler

2020-11-12 14:10:34

Subject: [PATCH] iosched: Add i10 I/O Scheduler

From: Rachit Agarwal <[email protected]>>

Hi All,

I/O batching is beneficial for optimizing IOPS and throughput for various
applications. For instance, several kernel block drivers would benefit from batching,
including mmc [1] and tcp-based storage drivers like nvme-tcp [2,3]. While we have
support for batching dispatch [4], we need an I/O scheduler to efficiently enable
batching. Such a scheduler is particularly interesting for disaggregated storage,
where the access latency of remote disaggregated storage may be higher than local
storage access; thus, batching can significantly help in amortizing the remote access
latency while increasing the throughput.

This patch introduces the i10 I/O scheduler, which performs batching per hctx in terms
of #requests, #bytes, and timeouts (at microseconds granularity). i10 starts
dispatching only when #requests or #bytes is larger than a default threshold or when
a timer expires. After that, batching dispatch [3] would happen, allowing batching
at device drivers along with "bd->last" and ".commit_rqs".

The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
and many additional results are presented in [6].

While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
nor a need for a global tracking context, so a new scheduler is needed rather than
to build this functionality to an existing scheduler.

We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
tests in [6]. For our future work, we plan to support adaptive batching according to
system load and to extend the scheduler to support isolation in multi-tenant deployments
(to simultaneously achieve low tail latency for latency-sensitive applications and high
throughput for throughput-bound applications).

References
[1] https://lore.kernel.org/linux-block/[email protected]/T/#mc48a8fb6069843827458f5fea722e1179d32af2a
[2] https://git.infradead.org/nvme.git/commit/122e5b9f3d370ae11e1502d14ff5c7ea9b144a76
[3] https://git.infradead.org/nvme.git/commit/86f0348ace1510d7ac25124b096fb88a6ab45270
[4] https://lore.kernel.org/linux-block/[email protected]/
[5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
[6] https://www.usenix.org/conference/nsdi20/presentation/hwang

Signed-off-by: Jaehyun Hwang <[email protected]>
Signed-off-by: Qizhe Cai <[email protected]>
Signed-off-by: Midhul Vuppalapati <[email protected]>
Signed-off-by: Rachit Agarwal <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
---
Documentation/block/i10-iosched.rst | 69 +++++
block/Kconfig.iosched | 8 +
block/Makefile | 1 +
block/i10-iosched.c | 421 ++++++++++++++++++++++++++++
4 files changed, 499 insertions(+)
create mode 100644 Documentation/block/i10-iosched.rst
create mode 100644 block/i10-iosched.c

diff --git a/Documentation/block/i10-iosched.rst b/Documentation/block/i10-iosched.rst
new file mode 100644
index 000000000000..3db7ca3ed4c1
--- /dev/null
+++ b/Documentation/block/i10-iosched.rst
@@ -0,0 +1,69 @@
+==========================
+i10 I/O scheduler overview
+==========================
+
+I/O batching is beneficial for optimizing IOPS and throughput for various
+applications. For instance, several kernel block drivers would
+benefit from batching, including mmc [1] and tcp-based storage drivers like
+nvme-tcp [2,3]. While we have support for batching dispatch [4], we need an
+I/O scheduler to efficiently enable batching. Such a scheduler is particularly
+interesting for disaggregated storage, where the access latency of remote
+disaggregated storage may be higher than local storage access; thus, batching
+can significantly help in amortizing the remote access latency while increasing
+the throughput.
+
+This patch introduces the i10 I/O scheduler, which performs batching per hctx in terms
+of #requests, #bytes, and timeouts (at microseconds granularity). i10 starts
+dispatching only when #requests or #bytes is larger than a default threshold or when
+a timer expires. After that, batching dispatch [3] would happen, allowing batching
+at device drivers along with "bd->last" and ".commit_rqs".
+
+The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
+scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
+of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
+RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
+terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
+and many additional results are presented in [6].
+
+While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
+in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
+nor a need for a global tracking context, so a new scheduler is needed rather than
+to build this functionality to an existing scheduler.
+
+We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
+64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
+tests in [6]. For our future work, we plan to support adaptive batching according to
+system load and to extend the scheduler to support isolation in multi-tenant deployments
+(to simultaneously achieve low tail latency for latency-sensitive applications and high
+throughput for throughput-bound applications).
+
+References
+[1] https://lore.kernel.org/linux-block/[email protected]/T/#mc48a8fb6069843827458f5fea722e1179d32af2a
+[2] https://git.infradead.org/nvme.git/commit/122e5b9f3d370ae11e1502d14ff5c7ea9b144a76
+[3] https://git.infradead.org/nvme.git/commit/86f0348ace1510d7ac25124b096fb88a6ab45270
+[4] https://lore.kernel.org/linux-block/[email protected]/
+[5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
+[6] https://www.usenix.org/conference/nsdi20/presentation/hwang
+
+==========================
+i10 I/O scheduler tunables
+==========================
+
+The three tunables for the i10 scheduler are the number of requests for
+reads/writes, the number of bytes for writes, and a timeout value.
+i10 will use these values for batching requests.
+
+batch_nr
+--------
+Number of requests for batching read/write requests
+Default: 16
+
+batch_bytes
+-----------
+Number of bytes for batching write requests
+Default: 65536 (bytes)
+
+batch_timeout
+-------------
+Timeout value for batching (in microseconds)
+Default: 50 (us)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 2f2158e05a91..5b3623b19487 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -44,6 +44,14 @@ config BFQ_CGROUP_DEBUG
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.

+config MQ_IOSCHED_I10
+ tristate "i10 I/O scheduler"
+ default y
+ help
+ The i10 I/O Scheduler supports batching at BLK-MQ.
+ Any device driver that benefits from batching
+ (e.g., NVMe-over-TCP) can use this scheduler.
+
endmenu

endif
diff --git a/block/Makefile b/block/Makefile
index 8d841f5f986f..27e0789589ea 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
+obj-$(CONFIG_MQ_IOSCHED_I10) += i10-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
obj-$(CONFIG_IOSCHED_BFQ) += bfq.o

diff --git a/block/i10-iosched.c b/block/i10-iosched.c
new file mode 100644
index 000000000000..b5451beaa66d
--- /dev/null
+++ b/block/i10-iosched.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * The i10 I/O Scheduler - supports batching at blk-mq.
+ * The main use case is disaggregated storage access
+ * using NVMe-over-Fabric (e.g., NVMe-over-TCP device driver).
+ *
+ * An early version of the idea is described and evaluated in
+ * "TCP ≈ RDMA: CPU-efficient Remote Storage Access with i10",
+ * USENIX NSDI 2020.
+ *
+ * Copyright (C) 2020 Cornell University
+ * Jaehyun Hwang <[email protected]>
+ * Qizhe Cai <[email protected]>
+ * Midhul Vuppalapati <[email protected]%>
+ * Rachit Agarwal <[email protected]>
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/module.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-debugfs.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+
+/* Default batch size in number of requests */
+#define I10_DEF_BATCH_NR 16
+/* Default batch size in bytes (for write requests) */
+#define I10_DEF_BATCH_BYTES 65536
+/* Default timeout value for batching (us units) */
+#define I10_DEF_BATCH_TIMEOUT 50
+
+enum i10_state {
+ /* Batching state:
+ * Do not run dispatching until we have
+ * a certain amount of requests or a timer expires.
+ */
+ I10_STATE_BATCH,
+
+ /* Dispatching state:
+ * Run dispatching until all requests in the
+ * scheduler's hctx ihq are dispatched.
+ */
+ I10_STATE_DISPATCH,
+};
+
+struct i10_queue_data {
+ struct request_queue *q;
+
+ unsigned int def_batch_nr;
+ unsigned int def_batch_bytes;
+ unsigned int def_batch_timeout;
+};
+
+struct i10_hctx_queue {
+ spinlock_t lock;
+ struct list_head rq_list;
+
+ struct blk_mq_hw_ctx *hctx;
+
+ unsigned int batch_nr;
+ unsigned int batch_bytes;
+ unsigned int batch_timeout;
+
+ unsigned int qlen_nr;
+ unsigned int qlen_bytes;
+
+ struct hrtimer dispatch_timer;
+ enum i10_state state;
+};
+
+static struct i10_queue_data *i10_queue_data_alloc(struct request_queue *q)
+{
+ struct i10_queue_data *iqd;
+
+ iqd = kzalloc_node(sizeof(*iqd), GFP_KERNEL, q->node);
+ if (!iqd)
+ return ERR_PTR(-ENOMEM);
+
+ iqd->q = q;
+ iqd->def_batch_nr = I10_DEF_BATCH_NR;
+ iqd->def_batch_bytes = I10_DEF_BATCH_BYTES;
+ iqd->def_batch_timeout = I10_DEF_BATCH_TIMEOUT;
+
+ return iqd;
+}
+
+static int i10_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+ struct i10_queue_data *iqd;
+ struct elevator_queue *eq;
+
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
+ iqd = i10_queue_data_alloc(q);
+ if (IS_ERR(iqd)) {
+ kobject_put(&eq->kobj);
+ return PTR_ERR(iqd);
+ }
+
+ blk_stat_enable_accounting(q);
+
+ eq->elevator_data = iqd;
+ q->elevator = eq;
+
+ return 0;
+}
+
+static void i10_exit_sched(struct elevator_queue *e)
+{
+ struct i10_queue_data *iqd = e->elevator_data;
+
+ kfree(iqd);
+}
+
+enum hrtimer_restart i10_hctx_timeout_handler(struct hrtimer *timer)
+{
+ struct i10_hctx_queue *ihq =
+ container_of(timer, struct i10_hctx_queue,
+ dispatch_timer);
+
+ ihq->state = I10_STATE_DISPATCH;
+ blk_mq_run_hw_queue(ihq->hctx, true);
+
+ return HRTIMER_NORESTART;
+}
+
+static void i10_hctx_queue_reset(struct i10_hctx_queue *ihq)
+{
+ ihq->qlen_nr = 0;
+ ihq->qlen_bytes = 0;
+ ihq->state = I10_STATE_BATCH;
+}
+
+static int i10_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+ struct i10_hctx_queue *ihq;
+
+ ihq = kzalloc_node(sizeof(*ihq), GFP_KERNEL, hctx->numa_node);
+ if (!ihq)
+ return -ENOMEM;
+
+ spin_lock_init(&ihq->lock);
+ INIT_LIST_HEAD(&ihq->rq_list);
+
+ ihq->hctx = hctx;
+ ihq->batch_nr = 0;
+ ihq->batch_bytes = 0;
+ ihq->batch_timeout = 0;
+
+ hrtimer_init(&ihq->dispatch_timer,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ ihq->dispatch_timer.function = &i10_hctx_timeout_handler;
+
+ i10_hctx_queue_reset(ihq);
+
+ hctx->sched_data = ihq;
+
+ return 0;
+}
+
+static void i10_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ hrtimer_cancel(&ihq->dispatch_timer);
+ kfree(hctx->sched_data);
+}
+
+static bool i10_hctx_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+ unsigned int nr_segs)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+ struct list_head *rq_list = &ihq->rq_list;
+ bool merged;
+
+ spin_lock(&ihq->lock);
+ merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio, nr_segs);
+ spin_unlock(&ihq->lock);
+
+ if (merged && bio_data_dir(bio) == WRITE)
+ ihq->qlen_bytes += bio->bi_iter.bi_size;
+
+ return merged;
+}
+
+/*
+ * The batch size can be adjusted dynamically on a per-hctx basis.
+ * Use per-hctx variables in that case.
+ */
+static inline unsigned int i10_hctx_batch_nr(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ return ihq->batch_nr ?
+ ihq->batch_nr : iqd->def_batch_nr;
+}
+
+static inline unsigned int i10_hctx_batch_bytes(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ return ihq->batch_bytes ?
+ ihq->batch_bytes : iqd->def_batch_bytes;
+}
+
+static inline unsigned int i10_hctx_batch_timeout(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ return ihq->batch_timeout ?
+ ihq->batch_timeout : iqd->def_batch_timeout;
+}
+
+static void i10_hctx_insert_update(struct i10_hctx_queue *ihq,
+ struct request *rq)
+{
+ if (rq_data_dir(rq) == WRITE)
+ ihq->qlen_bytes += blk_rq_bytes(rq);
+ ihq->qlen_nr++;
+}
+
+static void i10_hctx_insert_requests(struct blk_mq_hw_ctx *hctx,
+ struct list_head *rq_list, bool at_head)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+ struct request *rq, *next;
+
+ list_for_each_entry_safe(rq, next, rq_list, queuelist) {
+ struct list_head *head = &ihq->rq_list;
+
+ spin_lock(&ihq->lock);
+ if (at_head)
+ list_move(&rq->queuelist, head);
+ else
+ list_move_tail(&rq->queuelist, head);
+ i10_hctx_insert_update(ihq, rq);
+ blk_mq_sched_request_inserted(rq);
+ spin_unlock(&ihq->lock);
+ }
+
+ /* Start a new timer */
+ if (ihq->state == I10_STATE_BATCH &&
+ !hrtimer_active(&ihq->dispatch_timer))
+ hrtimer_start(&ihq->dispatch_timer,
+ ns_to_ktime(i10_hctx_batch_timeout(hctx)
+ * NSEC_PER_USEC),
+ HRTIMER_MODE_REL);
+}
+
+static struct request *i10_hctx_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+ struct request *rq;
+
+ spin_lock(&ihq->lock);
+ rq = list_first_entry_or_null(&ihq->rq_list,
+ struct request, queuelist);
+ if (rq)
+ list_del_init(&rq->queuelist);
+ else
+ i10_hctx_queue_reset(ihq);
+ spin_unlock(&ihq->lock);
+
+ return rq;
+}
+
+static inline bool i10_hctx_dispatch_now(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ return (ihq->qlen_nr >= i10_hctx_batch_nr(hctx)) ||
+ (ihq->qlen_bytes >= i10_hctx_batch_bytes(hctx));
+}
+
+/*
+ * Return true if we are in the dispatching state.
+ */
+static bool i10_hctx_has_work(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ if (ihq->state == I10_STATE_BATCH) {
+ if (i10_hctx_dispatch_now(hctx)) {
+ ihq->state = I10_STATE_DISPATCH;
+ if (hrtimer_active(&ihq->dispatch_timer))
+ hrtimer_cancel(&ihq->dispatch_timer);
+ }
+ }
+
+ return (ihq->state == I10_STATE_DISPATCH);
+}
+
+#define I10_DEF_BATCH_SHOW_STORE(name) \
+static ssize_t i10_def_batch_##name##_show(struct elevator_queue *e, \
+ char *page) \
+{ \
+ struct i10_queue_data *iqd = e->elevator_data; \
+ \
+ return sprintf(page, "%u\n", iqd->def_batch_##name); \
+} \
+ \
+static ssize_t i10_def_batch_##name##_store(struct elevator_queue *e, \
+ const char *page, size_t count) \
+{ \
+ struct i10_queue_data *iqd = e->elevator_data; \
+ unsigned long long value; \
+ int ret; \
+ \
+ ret = kstrtoull(page, 10, &value); \
+ if (ret) \
+ return ret; \
+ \
+ iqd->def_batch_##name = value; \
+ \
+ return count; \
+}
+I10_DEF_BATCH_SHOW_STORE(nr);
+I10_DEF_BATCH_SHOW_STORE(bytes);
+I10_DEF_BATCH_SHOW_STORE(timeout);
+#undef I10_DEF_BATCH_SHOW_STORE
+
+#define I10_SCHED_ATTR(name) \
+ __ATTR(batch_##name, 0644, i10_def_batch_##name##_show, i10_def_batch_##name##_store)
+static struct elv_fs_entry i10_sched_attrs[] = {
+ I10_SCHED_ATTR(nr),
+ I10_SCHED_ATTR(bytes),
+ I10_SCHED_ATTR(timeout),
+ __ATTR_NULL
+};
+#undef I10_SCHED_ATTR
+
+#ifdef CONFIG_BLK_DEBUG_FS
+#define I10_DEBUGFS_SHOW(name) \
+static int i10_hctx_batch_##name##_show(void *data, struct seq_file *m) \
+{ \
+ struct blk_mq_hw_ctx *hctx = data; \
+ struct i10_hctx_queue *ihq = hctx->sched_data; \
+ \
+ seq_printf(m, "%u\n", ihq->batch_##name); \
+ return 0; \
+} \
+ \
+static int i10_hctx_qlen_##name##_show(void *data, struct seq_file *m) \
+{ \
+ struct blk_mq_hw_ctx *hctx = data; \
+ struct i10_hctx_queue *ihq = hctx->sched_data; \
+ \
+ seq_printf(m, "%u\n", ihq->qlen_##name); \
+ return 0; \
+}
+I10_DEBUGFS_SHOW(nr);
+I10_DEBUGFS_SHOW(bytes);
+#undef I10_DEBUGFS_SHOW
+
+static int i10_hctx_state_show(void *data, struct seq_file *m)
+{
+ struct blk_mq_hw_ctx *hctx = data;
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ seq_printf(m, "%d\n", ihq->state);
+ return 0;
+}
+
+#define I10_HCTX_QUEUE_ATTR(name) \
+ {"batch_" #name, 0400, i10_hctx_batch_##name##_show}, \
+ {"qlen_" #name, 0400, i10_hctx_qlen_##name##_show}
+static const struct blk_mq_debugfs_attr i10_hctx_debugfs_attrs[] = {
+ I10_HCTX_QUEUE_ATTR(nr),
+ I10_HCTX_QUEUE_ATTR(bytes),
+ {"state", 0400, i10_hctx_state_show},
+ {},
+};
+#undef I10_HCTX_QUEUE_ATTR
+#endif
+
+static struct elevator_type i10_sched = {
+ .ops = {
+ .init_sched = i10_init_sched,
+ .exit_sched = i10_exit_sched,
+ .init_hctx = i10_init_hctx,
+ .exit_hctx = i10_exit_hctx,
+ .bio_merge = i10_hctx_bio_merge,
+ .insert_requests = i10_hctx_insert_requests,
+ .dispatch_request = i10_hctx_dispatch_request,
+ .has_work = i10_hctx_has_work,
+ },
+#ifdef CONFIG_BLK_DEBUG_FS
+ .hctx_debugfs_attrs = i10_hctx_debugfs_attrs,
+#endif
+ .elevator_attrs = i10_sched_attrs,
+ .elevator_name = "i10",
+ .elevator_owner = THIS_MODULE,
+};
+
+static int __init i10_init(void)
+{
+ return elv_register(&i10_sched);
+}
+
+static void __exit i10_exit(void)
+{
+ elv_unregister(&i10_sched);
+}
+
+module_init(i10_init);
+module_exit(i10_exit);
+
+MODULE_AUTHOR("Jaehyun Hwang, Qizhe Cai, Midhul Vuppalapati, Rachit Agarwal");
+MODULE_LICENSE("GPLv2");
+MODULE_DESCRIPTION("i10 I/O scheduler");
--
2.22.0

2020-11-12 18:07:09

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

On 11/12/20 7:07 AM, Rachit Agarwal wrote:
> From: Rachit Agarwal <[email protected]>>
>
> Hi All,
>
> I/O batching is beneficial for optimizing IOPS and throughput for
> various applications. For instance, several kernel block drivers would
> benefit from batching, including mmc [1] and tcp-based storage drivers
> like nvme-tcp [2,3]. While we have support for batching dispatch [4],
> we need an I/O scheduler to efficiently enable batching. Such a
> scheduler is particularly interesting for disaggregated storage, where
> the access latency of remote disaggregated storage may be higher than
> local storage access; thus, batching can significantly help in
> amortizing the remote access latency while increasing the throughput.
>
> This patch introduces the i10 I/O scheduler, which performs batching
> per hctx in terms of #requests, #bytes, and timeouts (at microseconds
> granularity). i10 starts dispatching only when #requests or #bytes is
> larger than a default threshold or when a timer expires. After that,
> batching dispatch [3] would happen, allowing batching at device
> drivers along with "bd->last" and ".commit_rqs".
>
> The i10 I/O scheduler builds upon recent work on [6]. We have tested
> the i10 I/O scheduler with nvme-tcp optimizaitons [2,3] and batching
> dispatch [4], varying number of cores, varying read/write ratios, and
> varying request sizes, and with NVMe SSD and RAM block device. For
> NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in terms
> of IOPS per core over "noop" I/O scheduler. These results are
> available at [5], and many additional results are presented in [6].
>
> While other schedulers may also batch I/O (e.g., mq-deadline), the
> optimization target in the i10 I/O scheduler is throughput
> maximization. Hence there is no latency target nor a need for a global
> tracking context, so a new scheduler is needed rather than to build
> this functionality to an existing scheduler.
>
> We currently use fixed default values as batching thresholds (e.g., 16
> for #requests, 64KB for #bytes, and 50us for timeout). These default
> values are based on sensitivity tests in [6]. For our future work, we
> plan to support adaptive batching according to system load and to
> extend the scheduler to support isolation in multi-tenant deployments
> (to simultaneously achieve low tail latency for latency-sensitive
> applications and high throughput for throughput-bound applications).

I haven't taken a close look at the code yet so far, but one quick note
that patches like this should be against the branches for 5.11. In fact,
this one doesn't even compile against current -git, as
blk_mq_bio_list_merge is now called blk_bio_list_merge.

In any case, I did run this through some quick peak testing as I was
curious, and I'm seeing about 20% drop in peak IOPS over none running
this. Perf diff:

10.71% -2.44% [kernel.vmlinux] [k] read_tsc
2.33% -1.99% [kernel.vmlinux] [k] _raw_spin_lock

Also:

> [5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf

Was curious and wanted to look it up, but it doesn't exist.

--
Jens Axboe

2020-11-13 15:02:27

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

Hello,

On Thu, Nov 12, 2020 at 09:07:52AM -0500, Rachit Agarwal wrote:
> From: Rachit Agarwal <[email protected]>>
>
> Hi All,
>
> I/O batching is beneficial for optimizing IOPS and throughput for various
> applications. For instance, several kernel block drivers would benefit from batching,
> including mmc [1] and tcp-based storage drivers like nvme-tcp [2,3]. While we have
> support for batching dispatch [4], we need an I/O scheduler to efficiently enable
> batching. Such a scheduler is particularly interesting for disaggregated storage,
> where the access latency of remote disaggregated storage may be higher than local
> storage access; thus, batching can significantly help in amortizing the remote access
> latency while increasing the throughput.
>
> This patch introduces the i10 I/O scheduler, which performs batching per hctx in terms
> of #requests, #bytes, and timeouts (at microseconds granularity). i10 starts
> dispatching only when #requests or #bytes is larger than a default threshold or when
> a timer expires. After that, batching dispatch [3] would happen, allowing batching
> at device drivers along with "bd->last" and ".commit_rqs".

blk-mq actually has built-in batching(or sort of) mechanism, which is enabled
if the hw queue is busy(hctx->dispatch_busy is > 0). We use EWMA to compute
hctx->dispatch_busy, and it is adaptive, even though the implementation is quite
coarse. But there should be much space to improve, IMO.

It is reported that this way improves SQ high-end SCSI SSD very much[1],
and MMC performance gets improved too[2].

[1] https://lore.kernel.org/linux-block/[email protected]/
[2] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/

>
> The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
> scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
> of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
> RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
> terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
> and many additional results are presented in [6].

In case of none scheduler, basically nvme driver won't provide any queue busy
feedback, so the built-in batching dispatch doesn't work simply.

kyber scheduler uses io latency feedback to throttle and build io batch,
can you compare i10 with kyber on nvme/nvme-tcp?

>
> While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
> in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
> nor a need for a global tracking context, so a new scheduler is needed rather than
> to build this functionality to an existing scheduler.
>
> We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
> 64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
> tests in [6]. For our future work, we plan to support adaptive batching according to

Frankly speaking, hardcode 16 #rquests or 64KB may not work everywhere,
and product environment could be much complicated than your sensitivity
tests. If possible, please start with adaptive batching.

50us timeout can be contributed to IO latency, and I'd like to see io latency
data with i10, especially i10 vs. vanilla none.

Thanks,
Ming

2020-11-13 20:36:52

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

> I haven't taken a close look at the code yet so far, but one quick note
> that patches like this should be against the branches for 5.11. In fact,
> this one doesn't even compile against current -git, as
> blk_mq_bio_list_merge is now called blk_bio_list_merge.

Ugh, I guess that Jaehyun had this patch bottled up and didn't rebase
before submitting.. Sorry about that.

> In any case, I did run this through some quick peak testing as I was
> curious, and I'm seeing about 20% drop in peak IOPS over none running
> this. Perf diff:
>
> 10.71% -2.44% [kernel.vmlinux] [k] read_tsc
> 2.33% -1.99% [kernel.vmlinux] [k] _raw_spin_lock

You ran this with nvme? or null_blk? I guess neither would benefit
from this because if the underlying device will not benefit from
batching (at least enough for the extra cost of accounting for it) it
will be counter productive to use this scheduler.

> Also:
>
>> [5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
>
> Was curious and wanted to look it up, but it doesn't exist.

I think this is the right one:
https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf

We had some back and forth around the naming, hence this was probably
omitted.

2020-11-13 21:02:06

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

> blk-mq actually has built-in batching(or sort of) mechanism, which is enabled
> if the hw queue is busy(hctx->dispatch_busy is > 0). We use EWMA to compute
> hctx->dispatch_busy, and it is adaptive, even though the implementation is quite
> coarse. But there should be much space to improve, IMO.

You are correct, however nvme-tcp should be getting to dispatch_busy > 0
IIUC.

> It is reported that this way improves SQ high-end SCSI SSD very much[1],
> and MMC performance gets improved too[2].
>
> [1] https://lore.kernel.org/linux-block/[email protected]/
> [2] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/

Yes, the guys paid attention to the MMC related improvements that you
made.

>> The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
>> scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
>> of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
>> RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
>> terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
>> and many additional results are presented in [6].
>
> In case of none scheduler, basically nvme driver won't provide any queue busy
> feedback, so the built-in batching dispatch doesn't work simply.

Exactly.

> kyber scheduler uses io latency feedback to throttle and build io batch,
> can you compare i10 with kyber on nvme/nvme-tcp?

I assume it should be simple to get, I'll let Rachit/Jaehyun comment.

>> While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
>> in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
>> nor a need for a global tracking context, so a new scheduler is needed rather than
>> to build this functionality to an existing scheduler.
>>
>> We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
>> 64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
>> tests in [6]. For our future work, we plan to support adaptive batching according to
>
> Frankly speaking, hardcode 16 #rquests or 64KB may not work everywhere,
> and product environment could be much complicated than your sensitivity
> tests. If possible, please start with adaptive batching.

That was my feedback as well for sure. But given that this is a
scheduler one would opt-in to anyway, that won't be a must-have
initially. I'm not sure if the guys made progress with this yet, I'll
let them comment.

2020-11-13 21:05:18

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

On 11/13/20 1:34 PM, Sagi Grimberg wrote:
>
>> I haven't taken a close look at the code yet so far, but one quick note
>> that patches like this should be against the branches for 5.11. In fact,
>> this one doesn't even compile against current -git, as
>> blk_mq_bio_list_merge is now called blk_bio_list_merge.
>
> Ugh, I guess that Jaehyun had this patch bottled up and didn't rebase
> before submitting.. Sorry about that.
>
>> In any case, I did run this through some quick peak testing as I was
>> curious, and I'm seeing about 20% drop in peak IOPS over none running
>> this. Perf diff:
>>
>> 10.71% -2.44% [kernel.vmlinux] [k] read_tsc
>> 2.33% -1.99% [kernel.vmlinux] [k] _raw_spin_lock
>
> You ran this with nvme? or null_blk? I guess neither would benefit
> from this because if the underlying device will not benefit from
> batching (at least enough for the extra cost of accounting for it) it
> will be counter productive to use this scheduler.

This is nvme, actual device. The initial posting could be a bit more
explicit on the use case, it says:

"For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
terms of IOPS per core over "noop" I/O scheduler."

which made me very skeptical, as it sounds like it's raw device claims.

Does beg the question of why this is a new scheduler then. It's pretty
basic stuff, something that could trivially just be added a side effect
of the core (and in fact we have much of it already). Doesn't really seem
to warrant a new scheduler at all. There isn't really much in there.

>>> [5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
>>
>> Was curious and wanted to look it up, but it doesn't exist.
>
> I think this is the right one:
> https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf
>
> We had some back and forth around the naming, hence this was probably
> omitted.

That works, my local results were a bit worse than listed in there though.
And what does this mean:

"We note that Linux I/O scheduler introduces an additional kernel worker
thread at the I/O dispatching stage"

It most certainly does not for the common/hot case.

--
Jens Axboe

2020-11-13 21:33:52

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

>>> I haven't taken a close look at the code yet so far, but one quick note
>>> that patches like this should be against the branches for 5.11. In fact,
>>> this one doesn't even compile against current -git, as
>>> blk_mq_bio_list_merge is now called blk_bio_list_merge.
>>
>> Ugh, I guess that Jaehyun had this patch bottled up and didn't rebase
>> before submitting.. Sorry about that.
>>
>>> In any case, I did run this through some quick peak testing as I was
>>> curious, and I'm seeing about 20% drop in peak IOPS over none running
>>> this. Perf diff:
>>>
>>> 10.71% -2.44% [kernel.vmlinux] [k] read_tsc
>>> 2.33% -1.99% [kernel.vmlinux] [k] _raw_spin_lock
>>
>> You ran this with nvme? or null_blk? I guess neither would benefit
>> from this because if the underlying device will not benefit from
>> batching (at least enough for the extra cost of accounting for it) it
>> will be counter productive to use this scheduler.
>
> This is nvme, actual device. The initial posting could be a bit more
> explicit on the use case, it says:
>
> "For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
> terms of IOPS per core over "noop" I/O scheduler."
>
> which made me very skeptical, as it sounds like it's raw device claims.

You are absolutely right, that needs to be fixed.

> Does beg the question of why this is a new scheduler then. It's pretty
> basic stuff, something that could trivially just be added a side effect
> of the core (and in fact we have much of it already). Doesn't really seem
> to warrant a new scheduler at all. There isn't really much in there.

Not saying it absolutely warrants a new one, and it could I guess sit in
the core, but this attempts to optimize for a specific metric while
trading-off others, which is exactly what I/O schedulers are for,
optimizing for a specific metric.

Not sure we want to build something biases towards throughput on the
expense of latency into the block core. And, as mentioned this is not
well suited to all device types...

But if you think this has a better home, I'm assuming that the guys
will be open to that.

>>> Was curious and wanted to look it up, but it doesn't exist.
>>
>> I think this is the right one:
>> https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf
>>
>> We had some back and forth around the naming, hence this was probably
>> omitted.
>
> That works, my local results were a bit worse than listed in there though.
> And what does this mean:
>
> "We note that Linux I/O scheduler introduces an additional kernel worker
> thread at the I/O dispatching stage"
>
> It most certainly does not for the common/hot case.

Yes I agree, didn't see the local results. Probably some
misunderstanding or a typo, I'll let them reply on this.

2020-11-13 21:39:09

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

On 11/13/20 2:23 PM, Sagi Grimberg wrote:
>
>>>> I haven't taken a close look at the code yet so far, but one quick note
>>>> that patches like this should be against the branches for 5.11. In fact,
>>>> this one doesn't even compile against current -git, as
>>>> blk_mq_bio_list_merge is now called blk_bio_list_merge.
>>>
>>> Ugh, I guess that Jaehyun had this patch bottled up and didn't rebase
>>> before submitting.. Sorry about that.
>>>
>>>> In any case, I did run this through some quick peak testing as I was
>>>> curious, and I'm seeing about 20% drop in peak IOPS over none running
>>>> this. Perf diff:
>>>>
>>>> 10.71% -2.44% [kernel.vmlinux] [k] read_tsc
>>>> 2.33% -1.99% [kernel.vmlinux] [k] _raw_spin_lock
>>>
>>> You ran this with nvme? or null_blk? I guess neither would benefit
>>> from this because if the underlying device will not benefit from
>>> batching (at least enough for the extra cost of accounting for it) it
>>> will be counter productive to use this scheduler.
>>
>> This is nvme, actual device. The initial posting could be a bit more
>> explicit on the use case, it says:
>>
>> "For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
>> terms of IOPS per core over "noop" I/O scheduler."
>>
>> which made me very skeptical, as it sounds like it's raw device claims.
>
> You are absolutely right, that needs to be fixed.
>
>> Does beg the question of why this is a new scheduler then. It's pretty
>> basic stuff, something that could trivially just be added a side effect
>> of the core (and in fact we have much of it already). Doesn't really seem
>> to warrant a new scheduler at all. There isn't really much in there.
>
> Not saying it absolutely warrants a new one, and it could I guess sit in
> the core, but this attempts to optimize for a specific metric while
> trading-off others, which is exactly what I/O schedulers are for,
> optimizing for a specific metric.
>
> Not sure we want to build something biases towards throughput on the
> expense of latency into the block core. And, as mentioned this is not
> well suited to all device types...
>
> But if you think this has a better home, I'm assuming that the guys
> will be open to that.

Also see the reply from Ming. It's a balancing act - don't want to add
extra overhead to the core, but also don't want to carry an extra
scheduler if the main change is really just variable dispatch batching.
And since we already have a notion of that, seems worthwhile to explore
that venue.

--
Jens Axboe

2020-11-13 21:40:06

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

>> But if you think this has a better home, I'm assuming that the guys
>> will be open to that.
>
> Also see the reply from Ming. It's a balancing act - don't want to add
> extra overhead to the core, but also don't want to carry an extra
> scheduler if the main change is really just variable dispatch batching.
> And since we already have a notion of that, seems worthwhile to explore
> that venue.

I agree,

The main difference is that this balancing is not driven from device
resource pressure, but rather from an assumption of device specific
optimization (and also with a specific optimization target), hence a
scheduler a user would need to opt-in seemed like a good compromise.

But maybe Ming has some good ideas on a different way to add it..

2020-11-13 21:46:40

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

On 11/13/20 2:36 PM, Sagi Grimberg wrote:
>
>>> But if you think this has a better home, I'm assuming that the guys
>>> will be open to that.
>>
>> Also see the reply from Ming. It's a balancing act - don't want to add
>> extra overhead to the core, but also don't want to carry an extra
>> scheduler if the main change is really just variable dispatch batching.
>> And since we already have a notion of that, seems worthwhile to explore
>> that venue.
>
> I agree,
>
> The main difference is that this balancing is not driven from device
> resource pressure, but rather from an assumption of device specific
> optimization (and also with a specific optimization target), hence a
> scheduler a user would need to opt-in seemed like a good compromise.
>
> But maybe Ming has some good ideas on a different way to add it..

So here's another case - virtualized nvme. The commit overhead is
suitably large there that performance suffers quite a bit, similarly to
your remote storage case. If we had suitable logic in the core, then we
could easily propagate this knowledge when setting up the queue. Then it
could happen automatically, without needing a configuration to switch to
a specific scheduler.

--
Jens Axboe

2020-11-13 21:58:01

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

>>>> But if you think this has a better home, I'm assuming that the guys
>>>> will be open to that.
>>>
>>> Also see the reply from Ming. It's a balancing act - don't want to add
>>> extra overhead to the core, but also don't want to carry an extra
>>> scheduler if the main change is really just variable dispatch batching.
>>> And since we already have a notion of that, seems worthwhile to explore
>>> that venue.
>>
>> I agree,
>>
>> The main difference is that this balancing is not driven from device
>> resource pressure, but rather from an assumption of device specific
>> optimization (and also with a specific optimization target), hence a
>> scheduler a user would need to opt-in seemed like a good compromise.
>>
>> But maybe Ming has some good ideas on a different way to add it..
>
> So here's another case - virtualized nvme. The commit overhead is
> suitably large there that performance suffers quite a bit, similarly to
> your remote storage case. If we had suitable logic in the core, then we
> could easily propagate this knowledge when setting up the queue. Then it
> could happen automatically, without needing a configuration to switch to
> a specific scheduler.

Yes, these use-cases share characteristics. I'm not at all opposed to
placing this in the core. I do think that in order to put something like
this in the core, the bar needs to be higher such that an optimization
target cannot be biased towards a workload (i.e. needs to be adaptive).

I'm still not sure how we would build this on top of what we already
have as it is really centered around device being busy (which is not
the case for nvme), but I didn't put enough thought into it yet.

2020-11-16 08:45:29

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

On Fri, Nov 13, 2020 at 01:36:16PM -0800, Sagi Grimberg wrote:
>
> > > But if you think this has a better home, I'm assuming that the guys
> > > will be open to that.
> >
> > Also see the reply from Ming. It's a balancing act - don't want to add
> > extra overhead to the core, but also don't want to carry an extra
> > scheduler if the main change is really just variable dispatch batching.
> > And since we already have a notion of that, seems worthwhile to explore
> > that venue.
>
> I agree,
>
> The main difference is that this balancing is not driven from device
> resource pressure, but rather from an assumption of device specific
> optimization (and also with a specific optimization target), hence a
> scheduler a user would need to opt-in seemed like a good compromise.
>
> But maybe Ming has some good ideas on a different way to add it..

Not yet, :-(

It is one very good work to show that IO is improved with batching.

One big question I am still not clear is that how NVMe-TCP performance(
should be throughput according to 'Introduction' part of paper[1]) is
improved much when batching IO is applied. Is it because network stack
performs much well for transporting big chunk data? Or context switch overhead is
reduced because 'Ringing the doorbell' implies worker queue scheduling,
according to '2.4 Delayed Doorbells' of [1]. Or both? Or others? Do we
have data wrt. how much improvement from each factor?

Another question is that 'Introduction' of [1] part mentions that i10 is
more for 'throughput-bound applications'. And 'at low loads, latencies
may be high(within 1.7? of NVMe-over-RDMA latency over storage devices))',
so i10 scheduler is primarily for throughput-bound applications? If yes,
I'd suggest to add the words to commit log for helping people to review.
Then we can avoid to consider IO latency sensitive usages(such as iopoll).

[1] https://www.usenix.org/conference/nsdi20/presentation/hwang

Thanks,
Ming

2020-11-30 19:26:14

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

> Dear all:
>
> Thanks, again, for the very constructive decisions.
>
> I am writing back with quite a few updates:
>
> 1. We have now included a detailed comparison of i10 scheduler with
> Kyber with NVMe-over-TCP
> (https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf).
> In a nutshell, when operating with NVMe-over-TCP, i10 demonstrates the
> core tradeoff: higher latency, but also higher throughput. This seems to
> be the core tradeoff exposed by i10.
>
> 2. We have now implemented an adaptive version of i10 I/O scheduler,
> that uses the number of outstanding requests at the time of batch
> dispatch (and whether the dispatch was triggered by timeout or not) to
> adaptively set the batching size. The new results
> (https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf)
> show that i10-adaptive further improves performance for low loads, while
> keeping the performance for high loads. IMO, there is much to do on
> designing improved adaptation algorithms.
>
> 3. We have now updated the i10-evaluation document to include results
> for local storage access. The core take-away here is that i10-adaptive
> can achieve similar throughput and latency at low loads and at high
> loads when compared to noop, but still requires more work for lower
> loads. However, given that the tradeoff exposed by i10 scheduler is
> particularly useful for remote storage devices (and as Jens suggested,
> perhaps for virtualized local storage access), I do agree with Sagi -- I
> think we should consider including it in the core, since this may be
> useful for a broad range of new use cases.
>
> We have also created a second version of the patch that includes these
> updates:
> https://github.com/i10-kernel/upstream-linux/blob/master/0002-iosched-Add-i10-I-O-Scheduler.patch
>
> As always, thank you for the constructive discussion and I look forward
> to working with you on this.

Thanks Rachit,

Would be good if you can send a formal patch for the adaptive queuing so
people can have a look.

One thing that was left on the table is weather this should be a full
blown scheduler or a core block infrastructure that would either be
set on-demand or by default.

I think that Jens and Ming expressed that this should be something that
we should place this in the block core, but I'd like to hear maybe
Ming can elaborate on his ideas how to do this.

2021-01-11 18:19:13

by Rachit Agarwal

[permalink] [raw]

Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

[Resending the last message] Happy 2021 everyone!

> Dear all:
>
> Hope you are all well.
>
> Sagi and I were wondering if you have any additional feedback on the
> updated patch? (@Ming?) We have been receiving a lot of
> interest/questions from industry on incorporation of i10 in the
> kernel. If people do not have additional feedback, it would be nice to
> move this forward.
>
> Looking forward to hearing from you!
> ~Rachit
>
> On Sat, Nov 28, 2020 at 12:49 AM Rachit Agarwal <[email protected]> wrote:
> >
> >
> >
> > On Fri, Nov 13, 2020 at 4:56 PM Sagi Grimberg <[email protected]> wrote:
> >>
> >>
> >> >>>> But if you think this has a better home, I'm assuming that the guys
> >> >>>> will be open to that.
> >> >>>
> >> >>> Also see the reply from Ming. It's a balancing act - don't want to add
> >> >>> extra overhead to the core, but also don't want to carry an extra
> >> >>> scheduler if the main change is really just variable dispatch batching.
> >> >>> And since we already have a notion of that, seems worthwhile to explore
> >> >>> that venue.
> >> >>
> >> >> I agree,
> >> >>
> >> >> The main difference is that this balancing is not driven from device
> >> >> resource pressure, but rather from an assumption of device specific
> >> >> optimization (and also with a specific optimization target), hence a
> >> >> scheduler a user would need to opt-in seemed like a good compromise.
> >> >>
> >> >> But maybe Ming has some good ideas on a different way to add it..
> >> >
> >> > So here's another case - virtualized nvme. The commit overhead is
> >> > suitably large there that performance suffers quite a bit, similarly to
> >> > your remote storage case. If we had suitable logic in the core, then we
> >> > could easily propagate this knowledge when setting up the queue. Then it
> >> > could happen automatically, without needing a configuration to switch to
> >> > a specific scheduler.
> >>
> >> Yes, these use-cases share characteristics. I'm not at all opposed to
> >> placing this in the core. I do think that in order to put something like
> >> this in the core, the bar needs to be higher such that an optimization
> >> target cannot be biased towards a workload (i.e. needs to be adaptive).
> >>
> >> I'm still not sure how we would build this on top of what we already
> >> have as it is really centered around device being busy (which is not
> >> the case for nvme), but I didn't put enough thought into it yet.
> >>
> >
> > Dear all:
> >
> > Thanks, again, for the very constructive decisions.
> >
> > I am writing back with quite a few updates:
> >
> > 1. We have now included a detailed comparison of i10 scheduler with Kyber with NVMe-over-TCP (https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf). In a nutshell, when operating with NVMe-over-TCP, i10 demonstrates the core tradeoff: higher latency, but also higher throughput. This seems to be the core tradeoff exposed by i10.
> >
> > 2. We have now implemented an adaptive version of i10 I/O scheduler, that uses the number of outstanding requests at the time of batch dispatch (and whether the dispatch was triggered by timeout or not) to adaptively set the batching size. The new results (https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf) show that i10-adaptive further improves performance for low loads, while keeping the performance for high loads. IMO, there is much to do on designing improved adaptation algorithms.
> >
> > 3. We have now updated the i10-evaluation document to include results for local storage access. The core take-away here is that i10-adaptive can achieve similar throughput and latency at low loads and at high loads when compared to noop, but still requires more work for lower loads. However, given that the tradeoff exposed by i10 scheduler is particularly useful for remote storage devices (and as Jens suggested, perhaps for virtualized local storage access), I do agree with Sagi -- I think we should consider including it in the core, since this may be useful for a broad range of new use cases.
> >
> > We have also created a second version of the patch that includes these updates: https://github.com/i10-kernel/upstream-linux/blob/master/0002-iosched-Add-i10-I-O-Scheduler.patch
> >
> > As always, thank you for the constructive discussion and I look forward to working with you on this.
> >
> > Best,
> > ~Rachit
> >