2022-01-10 09:12:38

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 0/13] blk: make blk-rq-qos policies pluggable and modular

Hi Jens

blk-rq-qos is a standalone framework out of io-sched and can be used to
control or observe the IO progress in block-layer with hooks. blk-rq-qos
is a great design but right now, it is totally fixed and built-in and shut
out peoples who want to use it with external module.

This patchset attempts to make blk-rq-qos framework pluggable and modular.
Then we can update the blk-rq-qos policy module w/o stopping the IO workload.
And it is more convenient to introduce new policy on old machines w/o udgrade
kernel. We can close all of the blk-rq-qos policy if we needn't any of them.
At the moment, the request_queue.rqos list is empty, we needn't to waste cpu
cyles on them.

In addition, a new simple policy is introduced in this patchset which is to
observe the IO statistics per cgroup. A new interface, 'blkio.iostat' is
added into blkio cgroup directories. A very simple tool in following link

https://github.com/jianchwa/iostat-cgrp.git

can be used to output the result in more friendly fashion, such as,
Device DATA IOPS BW RQSZ QLAT DLAT Cgroup
vda R 16.00/s 572.00KB/s 35.75K 9.46us 250.68us test
vdb W 249.00/s 50.34MB/s 207.02K 254.33us 137.41ms
Device META IOPS BW RQSZ QLAT DLAT Cgroup
vdb W 44.00/s 792.00KB/s 18.00K 191.20us 225.25ms
Device DATA IOPS BW RQSZ QLAT DLAT Cgroup
vda R 33.00/s 412.00KB/s 12.48K 8.49us 180.84us test
vdb W 65.00/s 12.71MB/s 200.31K 432.02us 335.31ms
vdb W 38.00/s 12.66MB/s 341.26K 135.56us 230.27ms test
Device META IOPS BW RQSZ QLAT DLAT Cgroup
vda R 5.00/s 68.00KB/s 13.60K 12.51us 162.52us test
vdb W 119.00/s 2.28MB/s 19.63K 10.40ms 149.88ms
Device DATA IOPS BW RQSZ QLAT DLAT Cgroup
vda R 20.00/s 232.00KB/s 11.60K 8.71us 514.30us test
vdb W 183.00/s 35.02MB/s 195.96K 196.82us 129.58ms
vdb W 1.00/s 380.00KB/s 380.00K 48.51us 552.68ms test

As you see, there is device name, meta or data, read or write, cgroup name ,etc.
If there is no cgroup name, it indicates root cgroup which dosn't include
children cgroup's IO. This is different from non-root cgroup.

The 1st patch introduces the general interfaces to make blk-rq-qos pluggable and
modular, such as register/unregister, activate/deactivate, queue sysfs interface.

The 2nd patch make blk-wbt pluggable

The 3rd and 4th patch export some interface which is prepared for following patches
to make iolatency, iocost and ioprio modular

The 5th patch make blk-iolatency pluggable and modular. It has cgroup policy, we
can rmmod it to release a blk cgroup policy slot.

The 6th remove an unused macro

The 7th patch introduce a new macro to control the bio.bi_iocost_cost, this is
also a preparation to make iocost modular.

The 8th patch make iocost pluggable and modular

The 9th patch rename ioprio.c to ioprio-common.c as it has same name with
following ioprio.ko in Makefilea

The 10th patch make ioprio policy pluggable and modular

The 11th patch remove some unused interfaces of blk-rq-qos.c, such as
rq_qos_add/del

The 12th patch make request carry blkcg_gq, this is needned by the following
iostat policy.

The 13th patch introduce the iostat policy of blk-rq-qos.

Wang Jianchao (13)
blk: make blk-rq-qos support pluggable and modular policy
blk-wbt: make wbt pluggable
blk: export following interfaces
cgroup: export following two interfaces
blk-iolatency: make iolatency pluggable and modular
blk: remove unused BLK_RQ_IO_DATA_LEN
blk: use standalone macro to control bio.bi_iocost_cost
blk-iocost: make iocost pluggable and modular
blk: rename ioprio.c to ioprio-common.c
blk-ioprio: make ioprio pluggable and modular
blk: remove unused interfaces of blk-rq-qos
blk: make request able to carry blkcg_gq
blk: introduce iostat per cgroup module

block/Kconfig | 23 ++++-
block/Makefile | 13 ++-
block/bdev.c | 5 -
block/bio.c | 2 +-
block/blk-cgroup.c | 23 +++--
block/blk-core.c | 6 +-
block/blk-iocost.c | 53 ++++++----
block/blk-iolatency.c | 39 +++++--
block/blk-ioprio.c | 50 ++++++---
block/blk-ioprio.h | 19 ----
block/blk-iostat.c | 347 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
block/blk-merge.c | 9 ++
block/blk-mq-debugfs.c | 22 +---
block/blk-mq.c | 14 +++
block/blk-rq-qos.c | 4 +-
block/blk-rq-qos.h | 67 +-----------
block/blk-stat.c | 30 ------
block/blk-stat.h | 31 +++++-
block/blk-sysfs.c | 7 +-
block/blk-wbt.c | 30 +++++-
block/blk-wbt.h | 8 +-
block/blk.h | 6 --
block/{ioprio.c => ioprio-common.c} | 0
include/linux/blk-cgroup.h | 1 +
include/linux/blk-mq.h | 4 +-
include/linux/blk_types.h | 2 +-
include/linux/blkdev.h | 7 +-
include/linux/cgroup.h | 5 +-
kernel/cgroup/cgroup.c | 7 ++
29 files changed, 599 insertions(+), 235 deletions(-)



2022-01-10 09:12:42

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 01/13] blk: make blk-rq-qos support pluggable and modular policy

From: Wang Jianchao <[email protected]>

blk-rq-qos is a standalone framework out of io-sched and can be
used to control or observe the IO progress in block-layer with
hooks. blk-rq-qos is a great design but right now, it is totally
fixed and built-in and shut out peoples who want to use it with
external module.

This patch make blk-rq-qos policies pluggable and modular.
(1) Add code to maintain the rq_qos_ops. A rq-qos module need to
register itself with rq_qos_register(). The original enum
rq_qos_id will be removed in following patch. They will use
a dynamic id maintained by rq_qos_ida.
(2) Add .init callback into rq_qos_ops. We use it to initialize the
resource.
(3) Add /sys/block/x/queue/qos
We can use '+name' or "-name" to open or close the blk-rq-qos
policy.

Because the rq-qos list can be modified at anytime, rq_qos_id()
which has been renamed to rq_qos_by_id() has to iterate the list
under sysfs_lock or queue_lock. This patch adapts the code for this.
More details, please refer to the comment above rq_qos_get(), And
the rq_qos_exit() is moved to blk_cleanup_queue. Except for these
modification, there is no other functional change here. Following
patches will adpat the code of wbt, iolatency, iocost and ioprio
to make them pluggable and modular one by one.

Signed-off-by: Wang Jianchao <[email protected]>
---
block/blk-core.c | 2 +
block/blk-iocost.c | 20 ++-
block/blk-mq-debugfs.c | 4 +-
block/blk-rq-qos.c | 312 ++++++++++++++++++++++++++++++++++++++++-
block/blk-rq-qos.h | 55 +++++++-
block/blk-sysfs.c | 2 +
block/blk-wbt.c | 6 +-
block/elevator.c | 3 +
block/genhd.c | 3 -
include/linux/blkdev.h | 4 +
10 files changed, 394 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1378d084c770..2847ab514c1f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -51,6 +51,7 @@
#include "blk-mq-sched.h"
#include "blk-pm.h"
#include "blk-throttle.h"
+#include "blk-rq-qos.h"

struct dentry *blk_debugfs_root;

@@ -377,6 +378,7 @@ void blk_cleanup_queue(struct request_queue *q)
* it is safe to free requests now.
*/
mutex_lock(&q->sysfs_lock);
+ rq_qos_exit(q);
if (q->elevator)
blk_mq_sched_free_rqs(q);
mutex_unlock(&q->sysfs_lock);
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 769b64394298..cfc0e305c32e 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -662,7 +662,7 @@ static struct ioc *rqos_to_ioc(struct rq_qos *rqos)

static struct ioc *q_to_ioc(struct request_queue *q)
{
- return rqos_to_ioc(rq_qos_id(q, RQ_QOS_COST));
+ return rqos_to_ioc(rq_qos_by_id(q, RQ_QOS_COST));
}

static const char *q_name(struct request_queue *q)
@@ -3162,6 +3162,7 @@ static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input,
size_t nbytes, loff_t off)
{
struct block_device *bdev;
+ struct rq_qos *rqos;
struct ioc *ioc;
u32 qos[NR_QOS_PARAMS];
bool enable, user;
@@ -3172,14 +3173,15 @@ static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input,
if (IS_ERR(bdev))
return PTR_ERR(bdev);

- ioc = q_to_ioc(bdev_get_queue(bdev));
- if (!ioc) {
+ rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
+ if (!rqos) {
ret = blk_iocost_init(bdev_get_queue(bdev));
if (ret)
goto err;
- ioc = q_to_ioc(bdev_get_queue(bdev));
+ rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
}

+ ioc = rqos_to_ioc(rqos);
spin_lock_irq(&ioc->lock);
memcpy(qos, ioc->params.qos, sizeof(qos));
enable = ioc->enabled;
@@ -3272,10 +3274,12 @@ static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input,
ioc_refresh_params(ioc, true);
spin_unlock_irq(&ioc->lock);

+ rq_qos_put(rqos);
blkdev_put_no_open(bdev);
return nbytes;
einval:
ret = -EINVAL;
+ rq_qos_put(rqos);
err:
blkdev_put_no_open(bdev);
return ret;
@@ -3329,6 +3333,7 @@ static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
size_t nbytes, loff_t off)
{
struct block_device *bdev;
+ struct rq_qos *rqos;
struct ioc *ioc;
u64 u[NR_I_LCOEFS];
bool user;
@@ -3339,14 +3344,15 @@ static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
if (IS_ERR(bdev))
return PTR_ERR(bdev);

- ioc = q_to_ioc(bdev_get_queue(bdev));
+ rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
if (!ioc) {
ret = blk_iocost_init(bdev_get_queue(bdev));
if (ret)
goto err;
- ioc = q_to_ioc(bdev_get_queue(bdev));
+ rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
}

+ ioc = rqos_to_ioc(rqos);
spin_lock_irq(&ioc->lock);
memcpy(u, ioc->params.i_lcoefs, sizeof(u));
user = ioc->user_cost_model;
@@ -3397,11 +3403,13 @@ static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
ioc_refresh_params(ioc, true);
spin_unlock_irq(&ioc->lock);

+ rq_qos_put(rqos);
blkdev_put_no_open(bdev);
return nbytes;

einval:
ret = -EINVAL;
+ rq_qos_put(rqos);
err:
blkdev_put_no_open(bdev);
return ret;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 4f2cf8399f3d..e3e8d54c836f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -841,7 +841,9 @@ void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
{
struct request_queue *q = rqos->q;
- const char *dir_name = rq_qos_id_to_name(rqos->id);
+ const char *dir_name;
+
+ dir_name = rqos->ops->name ? rqos->ops->name : rq_qos_id_to_name(rqos->id);

if (rqos->debugfs_dir || !rqos->ops->debugfs_attrs)
return;
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index e83af7bc7591..a94ff872722b 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -2,6 +2,11 @@

#include "blk-rq-qos.h"

+static DEFINE_IDA(rq_qos_ida);
+static int nr_rqos_blkcg_pols;
+static DEFINE_MUTEX(rq_qos_mutex);
+static LIST_HEAD(rq_qos_list);
+
/*
* Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
* false if 'v' + 1 would be bigger than 'below'.
@@ -294,11 +299,316 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data,

void rq_qos_exit(struct request_queue *q)
{
- blk_mq_debugfs_unregister_queue_rqos(q);
+ WARN_ON(!mutex_is_locked(&q->sysfs_lock));

while (q->rq_qos) {
struct rq_qos *rqos = q->rq_qos;
q->rq_qos = rqos->next;
+ if (rqos->ops->owner)
+ module_put(rqos->ops->owner);
rqos->ops->exit(rqos);
}
+ blk_mq_debugfs_unregister_queue_rqos(q);
+}
+
+/*
+ * After the pluggable blk-qos, rqos's life cycle become complicated,
+ * qos switching path can add/delete rqos to/from request_queue
+ * under sysfs_lock and queue_lock. There are following places
+ * may access rqos through rq_qos_by_id() concurrently:
+ * (1) normal IO path, under q_usage_counter,
+ * (2) queue sysfs interfaces, under sysfs_lock,
+ * (3) blkg_create, the .pd_init_fn() may access rqos, under queue_lock,
+ * (4) cgroup file, such as ioc_cost_model_write,
+ *
+ * (1)(2)(3) are definitely safe. case (4) is tricky. rq_qos_get() is
+ * for the case.
+ */
+struct rq_qos *rq_qos_get(struct request_queue *q, int id)
+{
+ struct rq_qos *rqos;
+
+ spin_lock_irq(&q->queue_lock);
+ rqos = rq_qos_by_id(q, id);
+ if (rqos && rqos->dying)
+ rqos = NULL;
+ if (rqos)
+ refcount_inc(&rqos->ref);
+ spin_unlock_irq(&q->queue_lock);
+ return rqos;
+}
+EXPORT_SYMBOL_GPL(rq_qos_get);
+
+void rq_qos_put(struct rq_qos *rqos)
+{
+ struct request_queue *q = rqos->q;
+
+ spin_lock_irq(&q->queue_lock);
+ refcount_dec(&rqos->ref);
+ if (rqos->dying)
+ wake_up(&rqos->waitq);
+ spin_unlock_irq(&q->queue_lock);
+}
+EXPORT_SYMBOL_GPL(rq_qos_put);
+
+void rq_qos_activate(struct request_queue *q,
+ struct rq_qos *rqos, const struct rq_qos_ops *ops)
+{
+ struct rq_qos *pos;
+ bool rq_alloc_time = false;
+
+ WARN_ON(!mutex_is_locked(&q->sysfs_lock));
+
+ rqos->dying = false;
+ refcount_set(&rqos->ref, 1);
+ init_waitqueue_head(&rqos->waitq);
+ rqos->id = ops->id;
+ rqos->ops = ops;
+ rqos->q = q;
+ rqos->next = NULL;
+
+ spin_lock_irq(&q->queue_lock);
+ pos = q->rq_qos;
+ if (pos) {
+ while (pos->next) {
+ if (pos->ops->flags & RQOS_FLAG_RQ_ALLOC_TIME)
+ rq_alloc_time = true;
+ pos = pos->next;
+ }
+ pos->next = rqos;
+ } else {
+ q->rq_qos = rqos;
+ }
+ if (ops->flags & RQOS_FLAG_RQ_ALLOC_TIME &&
+ !rq_alloc_time)
+ blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, q);
+
+ spin_unlock_irq(&q->queue_lock);
+
+ if (rqos->ops->debugfs_attrs)
+ blk_mq_debugfs_register_rqos(rqos);
+}
+EXPORT_SYMBOL_GPL(rq_qos_activate);
+
+void rq_qos_deactivate(struct rq_qos *rqos)
+{
+ struct request_queue *q = rqos->q;
+ struct rq_qos **cur, *pos;
+ bool rq_alloc_time = false;
+
+ WARN_ON(!mutex_is_locked(&q->sysfs_lock));
+
+ spin_lock_irq(&q->queue_lock);
+ rqos->dying = true;
+ /*
+ * Drain all of the usage of get/put_rqos()
+ */
+ wait_event_lock_irq(rqos->waitq,
+ refcount_read(&rqos->ref) == 1, q->queue_lock);
+ for (cur = &q->rq_qos; *cur; cur = &(*cur)->next) {
+ if (*cur == rqos) {
+ *cur = rqos->next;
+ break;
+ }
+ }
+
+ pos = q->rq_qos;
+ while (pos && pos->next) {
+ if (pos->ops->flags & RQOS_FLAG_RQ_ALLOC_TIME)
+ rq_alloc_time = true;
+ pos = pos->next;
+ }
+
+ if (rqos->ops->flags & RQOS_FLAG_RQ_ALLOC_TIME &&
+ !rq_alloc_time)
+ blk_queue_flag_clear(QUEUE_FLAG_RQ_ALLOC_TIME, q);
+
+ spin_unlock_irq(&q->queue_lock);
+ blk_mq_debugfs_unregister_rqos(rqos);
+}
+EXPORT_SYMBOL_GPL(rq_qos_deactivate);
+
+static struct rq_qos_ops *rq_qos_find_by_name(const char *name)
+{
+ struct rq_qos_ops *pos;
+
+ list_for_each_entry(pos, &rq_qos_list, node) {
+ if (!strncmp(pos->name, name, strlen(pos->name)))
+ return pos;
+ }
+
+ return NULL;
+}
+
+int rq_qos_register(struct rq_qos_ops *ops)
+{
+ int ret, start;
+
+ mutex_lock(&rq_qos_mutex);
+
+ if (rq_qos_find_by_name(ops->name)) {
+ ret = -EEXIST;
+ goto out;
+ }
+
+ if (ops->flags & RQOS_FLAG_CGRP_POL &&
+ nr_rqos_blkcg_pols >= (BLKCG_MAX_POLS - BLKCG_NON_RQOS_POLS)) {
+ ret = -ENOSPC;
+ goto out;
+ }
+
+ start = RQ_QOS_IOPRIO + 1;
+ ret = ida_simple_get(&rq_qos_ida, start, INT_MAX, GFP_KERNEL);
+ if (ret < 0)
+ goto out;
+
+ if (ops->flags & RQOS_FLAG_CGRP_POL)
+ nr_rqos_blkcg_pols++;
+
+ ops->id = ret;
+ ret = 0;
+ INIT_LIST_HEAD(&ops->node);
+ list_add_tail(&ops->node, &rq_qos_list);
+out:
+ mutex_unlock(&rq_qos_mutex);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(rq_qos_register);
+
+void rq_qos_unregister(struct rq_qos_ops *ops)
+{
+ mutex_lock(&rq_qos_mutex);
+
+ if (ops->flags & RQOS_FLAG_CGRP_POL)
+ nr_rqos_blkcg_pols--;
+ list_del_init(&ops->node);
+ ida_simple_remove(&rq_qos_ida, ops->id);
+ mutex_unlock(&rq_qos_mutex);
+}
+EXPORT_SYMBOL_GPL(rq_qos_unregister);
+
+ssize_t queue_qos_show(struct request_queue *q, char *buf)
+{
+ struct rq_qos_ops *ops;
+ struct rq_qos *rqos;
+ int ret = 0;
+
+ mutex_lock(&rq_qos_mutex);
+ /*
+ * Show the policies in the order of being invoked
+ */
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (!rqos->ops->name)
+ continue;
+ ret += sprintf(buf + ret, "[%s] ", rqos->ops->name);
+ }
+ list_for_each_entry(ops, &rq_qos_list, node) {
+ if (!rq_qos_by_name(q, ops->name))
+ ret += sprintf(buf + ret, "%s ", ops->name);
+ }
+
+ ret--; /* overwrite the last space */
+ ret += sprintf(buf + ret, "\n");
+ mutex_unlock(&rq_qos_mutex);
+
+ return ret;
+}
+
+int rq_qos_switch(struct request_queue *q,
+ const struct rq_qos_ops *ops,
+ struct rq_qos *rqos)
+{
+ int ret;
+
+ WARN_ON(!mutex_is_locked(&q->sysfs_lock));
+
+ blk_mq_freeze_queue(q);
+ if (!rqos) {
+ ret = ops->init(q);
+ } else {
+ ops->exit(rqos);
+ ret = 0;
+ }
+ blk_mq_unfreeze_queue(q);
+
+ return ret;
+}
+
+ssize_t queue_qos_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ const struct rq_qos_ops *ops;
+ struct rq_qos *rqos;
+ const char *qosname;
+ char *buf;
+ bool add;
+ int ret;
+
+ buf = kstrdup(page, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ buf = strim(buf);
+ if (buf[0] != '+' && buf[0] != '-') {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ add = buf[0] == '+';
+ qosname = buf + 1;
+
+ rqos = rq_qos_by_name(q, qosname);
+ if ((buf[0] == '+' && rqos)) {
+ ret = -EEXIST;
+ goto out;
+ }
+
+ if ((buf[0] == '-' && !rqos)) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ mutex_lock(&rq_qos_mutex);
+ if (add) {
+ ops = rq_qos_find_by_name(qosname);
+ if (!ops) {
+ /*
+ * module_init callback may request this mutex
+ */
+ mutex_unlock(&rq_qos_mutex);
+ request_module("%s", qosname);
+ mutex_lock(&rq_qos_mutex);
+ ops = rq_qos_find_by_name(qosname);
+ }
+ } else {
+ ops = rqos->ops;
+ }
+
+ if (!ops) {
+ ret = -EINVAL;
+ } else if (ops->owner && !try_module_get(ops->owner)) {
+ ops = NULL;
+ ret = -EAGAIN;
+ }
+ mutex_unlock(&rq_qos_mutex);
+
+ if (!ops)
+ goto out;
+
+ if (add) {
+ ret = rq_qos_switch(q, ops, NULL);
+ if (!ret && ops->owner)
+ __module_get(ops->owner);
+ } else {
+ rq_qos_switch(q, ops, rqos);
+ ret = 0;
+ if (ops->owner)
+ module_put(ops->owner);
+ }
+
+ if (ops->owner)
+ module_put(ops->owner);
+out:
+ kfree(buf);
+ return ret ? ret : count;
}
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 3cfbc8668cba..c2b9b41f8fd4 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -26,7 +26,10 @@ struct rq_wait {
};

struct rq_qos {
- struct rq_qos_ops *ops;
+ refcount_t ref;
+ wait_queue_head_t waitq;
+ bool dying;
+ const struct rq_qos_ops *ops;
struct request_queue *q;
enum rq_qos_id id;
struct rq_qos *next;
@@ -35,7 +38,17 @@ struct rq_qos {
#endif
};

+enum {
+ RQOS_FLAG_CGRP_POL = 1 << 0,
+ RQOS_FLAG_RQ_ALLOC_TIME = 1 << 1
+};
+
struct rq_qos_ops {
+ struct list_head node;
+ struct module *owner;
+ const char *name;
+ int flags;
+ int id;
void (*throttle)(struct rq_qos *, struct bio *);
void (*track)(struct rq_qos *, struct request *, struct bio *);
void (*merge)(struct rq_qos *, struct request *, struct bio *);
@@ -46,6 +59,7 @@ struct rq_qos_ops {
void (*cleanup)(struct rq_qos *, struct bio *);
void (*queue_depth_changed)(struct rq_qos *);
void (*exit)(struct rq_qos *);
+ int (*init)(struct request_queue *);
const struct blk_mq_debugfs_attr *debugfs_attrs;
};

@@ -59,10 +73,12 @@ struct rq_depth {
unsigned int default_depth;
};

-static inline struct rq_qos *rq_qos_id(struct request_queue *q,
- enum rq_qos_id id)
+static inline struct rq_qos *rq_qos_by_id(struct request_queue *q, int id)
{
struct rq_qos *rqos;
+
+ WARN_ON(!mutex_is_locked(&q->sysfs_lock) && !spin_is_locked(&q->queue_lock));
+
for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
if (rqos->id == id)
break;
@@ -72,12 +88,12 @@ static inline struct rq_qos *rq_qos_id(struct request_queue *q,

static inline struct rq_qos *wbt_rq_qos(struct request_queue *q)
{
- return rq_qos_id(q, RQ_QOS_WBT);
+ return rq_qos_by_id(q, RQ_QOS_WBT);
}

static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
{
- return rq_qos_id(q, RQ_QOS_LATENCY);
+ return rq_qos_by_id(q, RQ_QOS_LATENCY);
}

static inline void rq_wait_init(struct rq_wait *rq_wait)
@@ -132,6 +148,35 @@ static inline void rq_qos_del(struct request_queue *q, struct rq_qos *rqos)
blk_mq_debugfs_unregister_rqos(rqos);
}

+int rq_qos_register(struct rq_qos_ops *ops);
+void rq_qos_unregister(struct rq_qos_ops *ops);
+void rq_qos_activate(struct request_queue *q,
+ struct rq_qos *rqos, const struct rq_qos_ops *ops);
+void rq_qos_deactivate(struct rq_qos *rqos);
+ssize_t queue_qos_show(struct request_queue *q, char *buf);
+ssize_t queue_qos_store(struct request_queue *q, const char *page,
+ size_t count);
+struct rq_qos *rq_qos_get(struct request_queue *q, int id);
+void rq_qos_put(struct rq_qos *rqos);
+
+static inline struct rq_qos *rq_qos_by_name(struct request_queue *q,
+ const char *name)
+{
+ struct rq_qos *rqos;
+
+ WARN_ON(!mutex_is_locked(&q->sysfs_lock));
+
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (!rqos->ops->name)
+ continue;
+
+ if (!strncmp(rqos->ops->name, name,
+ strlen(rqos->ops->name)))
+ return rqos;
+ }
+ return NULL;
+}
+
typedef bool (acquire_inflight_cb_t)(struct rq_wait *rqw, void *private_data);
typedef void (cleanup_cb_t)(struct rq_wait *rqw, void *private_data);

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index cd75b0f73dc6..91f980985b1b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -573,6 +573,7 @@ QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
QUEUE_RW_ENTRY(elv_iosched, "scheduler");
+QUEUE_RW_ENTRY(queue_qos, "qos");

QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
QUEUE_RO_ENTRY(queue_physical_block_size, "physical_block_size");
@@ -632,6 +633,7 @@ static struct attribute *queue_attrs[] = {
&queue_max_integrity_segments_entry.attr,
&queue_max_segment_size_entry.attr,
&elv_iosched_entry.attr,
+ &queue_qos_entry.attr,
&queue_hw_sector_size_entry.attr,
&queue_logical_block_size_entry.attr,
&queue_physical_block_size_entry.attr,
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 0c119be0e813..88265ae4fa41 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -628,9 +628,13 @@ static void wbt_requeue(struct rq_qos *rqos, struct request *rq)

void wbt_set_write_cache(struct request_queue *q, bool write_cache_on)
{
- struct rq_qos *rqos = wbt_rq_qos(q);
+ struct rq_qos *rqos;
+
+ spin_lock_irq(&q->queue_lock);
+ rqos = wbt_rq_qos(q);
if (rqos)
RQWB(rqos)->wc = write_cache_on;
+ spin_unlock_irq(&q->queue_lock);
}

/*
diff --git a/block/elevator.c b/block/elevator.c
index 19a78d5516ba..fe664674c14d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -701,12 +701,15 @@ void elevator_init_mq(struct request_queue *q)
* requests, then no need to quiesce queue which may add long boot
* latency, especially when lots of disks are involved.
*/
+
+ mutex_lock(&q->sysfs_lock);
blk_mq_freeze_queue(q);
blk_mq_cancel_work_sync(q);

err = blk_mq_init_sched(q, e);

blk_mq_unfreeze_queue(q);
+ mutex_unlock(&q->sysfs_lock);

if (err) {
pr_warn("\"%s\" elevator initialization failed, "
diff --git a/block/genhd.c b/block/genhd.c
index 30362aeacac4..af2e8ebce46e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -27,7 +27,6 @@
#include <linux/badblocks.h>

#include "blk.h"
-#include "blk-rq-qos.h"

static struct kobject *block_depr;

@@ -621,8 +620,6 @@ void del_gendisk(struct gendisk *disk)
device_del(disk_to_dev(disk));

blk_mq_freeze_queue_wait(q);
-
- rq_qos_exit(q);
blk_sync_queue(q);
blk_flush_integrity();
/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bd4370baccca..e7dce2232814 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -43,6 +43,10 @@ struct blk_crypto_profile;
* Defined here to simplify include dependency.
*/
#define BLKCG_MAX_POLS 6
+/*
+ * Non blk-rq-qos blkcg policies include blk-throttle and bfq
+ */
+#define BLKCG_NON_RQOS_POLS 2

static inline int blk_validate_block_size(unsigned int bsize)
{
--
2.17.1


2022-01-10 09:13:16

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 08/13] blk-iocost: make iocost pluggable and modular

From: Wang Jianchao <[email protected]>

Make blk-iocost pluggable and modular. Then we can close or open
it through /sys/block/xxx/queue/qos and rmmod the module if we don't
need it which can release one blkcg policy slot.

Signed-off-by: Wang Jianchao <[email protected]>
---
block/Kconfig | 2 +-
block/Makefile | 4 ++--
block/blk-iocost.c | 53 ++++++++++++++++++++++++++----------------
block/blk-mq-debugfs.c | 2 --
block/blk-rq-qos.h | 1 -
5 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index e1b1bff5c1e9..3e1a3487b55a 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -134,7 +134,7 @@ config BLK_CGROUP_FC_APPID
application specific identification into the FC frame.

config BLK_CGROUP_IOCOST
- bool "Enable support for cost model based cgroup IO controller"
+ tristate "Enable support for cost model based cgroup IO controller"
depends on BLK_CGROUP
select BLK_RQ_ALLOC_TIME
select BLK_BIO_IOCOST
diff --git a/block/Makefile b/block/Makefile
index ccf61c57e1d4..8950913cbcc9 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,8 +20,8 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOPRIO) += blk-ioprio.o
iolat-y := blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += iolat.o
-
-obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
+iocost-y := blk-iocost.o
+obj-$(CONFIG_BLK_CGROUP_IOCOST) += iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index cfc0e305c32e..afa52354d42b 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -660,9 +660,10 @@ static struct ioc *rqos_to_ioc(struct rq_qos *rqos)
return container_of(rqos, struct ioc, rqos);
}

+static struct rq_qos_ops ioc_rqos_ops;
static struct ioc *q_to_ioc(struct request_queue *q)
{
- return rqos_to_ioc(rq_qos_by_id(q, RQ_QOS_COST));
+ return rqos_to_ioc(rq_qos_by_id(q, ioc_rqos_ops.id));
}

static const char *q_name(struct request_queue *q)
@@ -2810,6 +2811,7 @@ static void ioc_rqos_exit(struct rq_qos *rqos)
struct ioc *ioc = rqos_to_ioc(rqos);

blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
+ rq_qos_deactivate(rqos);

spin_lock_irq(&ioc->lock);
ioc->running = IOC_STOP;
@@ -2820,13 +2822,20 @@ static void ioc_rqos_exit(struct rq_qos *rqos)
kfree(ioc);
}

+static int blk_iocost_init(struct request_queue *q);
static struct rq_qos_ops ioc_rqos_ops = {
+#if IS_MODULE(CONFIG_BLK_CGROUP_IOCOST)
+ .owner = THIS_MODULE,
+#endif
+ .name = "iocost",
+ .flags = RQOS_FLAG_CGRP_POL | RQOS_FLAG_RQ_ALLOC_TIME,
.throttle = ioc_rqos_throttle,
.merge = ioc_rqos_merge,
.done_bio = ioc_rqos_done_bio,
.done = ioc_rqos_done,
.queue_depth_changed = ioc_rqos_queue_depth_changed,
.exit = ioc_rqos_exit,
+ .init = blk_iocost_init,
};

static int blk_iocost_init(struct request_queue *q)
@@ -2856,10 +2865,7 @@ static int blk_iocost_init(struct request_queue *q)
}

rqos = &ioc->rqos;
- rqos->id = RQ_QOS_COST;
- rqos->ops = &ioc_rqos_ops;
- rqos->q = q;
-
+ rq_qos_activate(q, rqos, &ioc_rqos_ops);
spin_lock_init(&ioc->lock);
timer_setup(&ioc->timer, ioc_timer_fn, 0);
INIT_LIST_HEAD(&ioc->active_iocgs);
@@ -2883,10 +2889,9 @@ static int blk_iocost_init(struct request_queue *q)
* called before policy activation completion, can't assume that the
* target bio has an iocg associated and need to test for NULL iocg.
*/
- rq_qos_add(q, rqos);
ret = blkcg_activate_policy(q, &blkcg_policy_iocost);
if (ret) {
- rq_qos_del(q, rqos);
+ rq_qos_deactivate(rqos);
free_percpu(ioc->pcpu_stat);
kfree(ioc);
return ret;
@@ -3173,12 +3178,10 @@ static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input,
if (IS_ERR(bdev))
return PTR_ERR(bdev);

- rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
+ rqos = rq_qos_get(bdev_get_queue(bdev), ioc_rqos_ops.id);
if (!rqos) {
- ret = blk_iocost_init(bdev_get_queue(bdev));
- if (ret)
- goto err;
- rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
+ ret = -EOPNOTSUPP;
+ goto err;
}

ioc = rqos_to_ioc(rqos);
@@ -3257,10 +3260,8 @@ static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input,

if (enable) {
blk_stat_enable_accounting(ioc->rqos.q);
- blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, ioc->rqos.q);
ioc->enabled = true;
} else {
- blk_queue_flag_clear(QUEUE_FLAG_RQ_ALLOC_TIME, ioc->rqos.q);
ioc->enabled = false;
}

@@ -3344,12 +3345,10 @@ static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
if (IS_ERR(bdev))
return PTR_ERR(bdev);

- rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
+ rqos = rq_qos_get(bdev_get_queue(bdev), ioc_rqos_ops.id);
if (!ioc) {
- ret = blk_iocost_init(bdev_get_queue(bdev));
- if (ret)
- goto err;
- rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
+ ret = -EOPNOTSUPP;
+ goto err;
}

ioc = rqos_to_ioc(rqos);
@@ -3449,13 +3448,27 @@ static struct blkcg_policy blkcg_policy_iocost = {

static int __init ioc_init(void)
{
- return blkcg_policy_register(&blkcg_policy_iocost);
+ int ret;
+
+ ret = rq_qos_register(&ioc_rqos_ops);
+ if (ret)
+ return ret;
+
+ ret = blkcg_policy_register(&blkcg_policy_iocost);
+ if (ret)
+ rq_qos_unregister(&ioc_rqos_ops);
+
+ return ret;
}

static void __exit ioc_exit(void)
{
blkcg_policy_unregister(&blkcg_policy_iocost);
+ rq_qos_unregister(&ioc_rqos_ops);
}

module_init(ioc_init);
module_exit(ioc_exit);
+MODULE_AUTHOR("Tejun Heo");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Cost model based cgroup IO controller");
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 57c33f4730f2..14fda9a5e552 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -820,8 +820,6 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q)
static const char *rq_qos_id_to_name(enum rq_qos_id id)
{
switch (id) {
- case RQ_QOS_COST:
- return "cost";
case RQ_QOS_IOPRIO:
return "ioprio";
}
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 6ca46c69e325..4eef53f2c290 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -14,7 +14,6 @@
struct blk_mq_debugfs_attr;

enum rq_qos_id {
- RQ_QOS_COST,
RQ_QOS_IOPRIO,
};

--
2.17.1


2022-01-10 09:13:26

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 10/13] blk-ioprio: make ioprio pluggable and modular

From: Wang Jianchao <[email protected]>

Make blk-ioprio pluggable and modular. Then we can close or open
it through /sys/block/xxx/queue/qos and rmmod the module if we don't
need it which can release one blkcg policy slot.

Signed-off-by: Wang Jianchao <[email protected]>
---
block/Kconfig | 2 +-
block/Makefile | 3 ++-
block/blk-cgroup.c | 5 -----
block/blk-ioprio.c | 50 ++++++++++++++++++++++++++++--------------
block/blk-ioprio.h | 19 ----------------
block/blk-mq-debugfs.c | 4 ----
block/blk-rq-qos.c | 2 +-
block/blk-rq-qos.h | 2 +-
8 files changed, 38 insertions(+), 49 deletions(-)
delete mode 100644 block/blk-ioprio.h

diff --git a/block/Kconfig b/block/Kconfig
index 3e1a3487b55a..b3a2c656a64b 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -145,7 +145,7 @@ config BLK_CGROUP_IOCOST
their share of the overall weight distribution.

config BLK_CGROUP_IOPRIO
- bool "Cgroup I/O controller for assigning an I/O priority class"
+ tristate "Cgroup I/O controller for assigning an I/O priority class"
depends on BLK_CGROUP
help
Enable the .prio interface for assigning an I/O priority class to
diff --git a/block/Makefile b/block/Makefile
index beacc3a03c8b..3f76836076b2 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -17,7 +17,8 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_CGROUP_RWSTAT) += blk-cgroup-rwstat.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
-obj-$(CONFIG_BLK_CGROUP_IOPRIO) += blk-ioprio.o
+ioprio-y := blk-ioprio.o
+obj-$(CONFIG_BLK_CGROUP_IOPRIO) += ioprio.o
iolat-y := blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += iolat.o
iocost-y := blk-iocost.o
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fd874dfd38ed..c5dc44194314 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -31,7 +31,6 @@
#include <linux/tracehook.h>
#include <linux/psi.h>
#include "blk.h"
-#include "blk-ioprio.h"
#include "blk-throttle.h"

/*
@@ -1204,10 +1203,6 @@ int blkcg_init_queue(struct request_queue *q)
if (preloaded)
radix_tree_preload_end();

- ret = blk_ioprio_init(q);
- if (ret)
- goto err_destroy_all;
-
ret = blk_throtl_init(q);
if (ret)
goto err_destroy_all;
diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
index 332a07761bf8..93d8ba698942 100644
--- a/block/blk-ioprio.c
+++ b/block/blk-ioprio.c
@@ -17,7 +17,6 @@
#include <linux/blk_types.h>
#include <linux/kernel.h>
#include <linux/module.h>
-#include "blk-ioprio.h"
#include "blk-rq-qos.h"

/**
@@ -209,15 +208,23 @@ static void blkcg_ioprio_exit(struct rq_qos *rqos)
container_of(rqos, typeof(*blkioprio_blkg), rqos);

blkcg_deactivate_policy(rqos->q, &ioprio_policy);
+ rq_qos_deactivate(rqos);
kfree(blkioprio_blkg);
}

+static int blk_ioprio_init(struct request_queue *q);
static struct rq_qos_ops blkcg_ioprio_ops = {
+#if IS_MODULE(CONFIG_BLK_CGROUP_IOPRIO)
+ .owner = THIS_MODULE,
+#endif
+ .flags = RQOS_FLAG_CGRP_POL,
+ .name = "ioprio",
.track = blkcg_ioprio_track,
.exit = blkcg_ioprio_exit,
+ .init = blk_ioprio_init,
};

-int blk_ioprio_init(struct request_queue *q)
+static int blk_ioprio_init(struct request_queue *q)
{
struct blk_ioprio *blkioprio_blkg;
struct rq_qos *rqos;
@@ -227,36 +234,45 @@ int blk_ioprio_init(struct request_queue *q)
if (!blkioprio_blkg)
return -ENOMEM;

+ /*
+ * No need to worry ioprio_blkcg_from_css return NULL as
+ * the queue is frozen right now.
+ */
+ rqos = &blkioprio_blkg->rqos;
+ rq_qos_activate(q, rqos, &blkcg_ioprio_ops);
+
ret = blkcg_activate_policy(q, &ioprio_policy);
if (ret) {
+ rq_qos_deactivate(rqos);
kfree(blkioprio_blkg);
- return ret;
}

- rqos = &blkioprio_blkg->rqos;
- rqos->id = RQ_QOS_IOPRIO;
- rqos->ops = &blkcg_ioprio_ops;
- rqos->q = q;
-
- /*
- * Registering the rq-qos policy after activating the blk-cgroup
- * policy guarantees that ioprio_blkcg_from_bio(bio) != NULL in the
- * rq-qos callbacks.
- */
- rq_qos_add(q, rqos);
-
- return 0;
+ return ret;
}

static int __init ioprio_init(void)
{
- return blkcg_policy_register(&ioprio_policy);
+ int ret;
+
+ ret = rq_qos_register(&blkcg_ioprio_ops);
+ if (ret)
+ return ret;
+
+ ret = blkcg_policy_register(&ioprio_policy);
+ if (ret)
+ rq_qos_unregister(&blkcg_ioprio_ops);
+
+ return ret;
}

static void __exit ioprio_exit(void)
{
blkcg_policy_unregister(&ioprio_policy);
+ rq_qos_unregister(&blkcg_ioprio_ops);
}

module_init(ioprio_init);
module_exit(ioprio_exit);
+MODULE_AUTHOR("Bart Van Assche");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Cgroup I/O controller for assigning an I/O priority class");
diff --git a/block/blk-ioprio.h b/block/blk-ioprio.h
deleted file mode 100644
index a7785c2f1aea..000000000000
--- a/block/blk-ioprio.h
+++ /dev/null
@@ -1,19 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-#ifndef _BLK_IOPRIO_H_
-#define _BLK_IOPRIO_H_
-
-#include <linux/kconfig.h>
-
-struct request_queue;
-
-#ifdef CONFIG_BLK_CGROUP_IOPRIO
-int blk_ioprio_init(struct request_queue *q);
-#else
-static inline int blk_ioprio_init(struct request_queue *q)
-{
- return 0;
-}
-#endif
-
-#endif /* _BLK_IOPRIO_H_ */
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 14fda9a5e552..90610a0cd25a 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -819,10 +819,6 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q)

static const char *rq_qos_id_to_name(enum rq_qos_id id)
{
- switch (id) {
- case RQ_QOS_IOPRIO:
- return "ioprio";
- }
return "unknown";
}

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 08ccd4a4e913..15852147ba73 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -459,7 +459,7 @@ int rq_qos_register(struct rq_qos_ops *ops)
goto out;
}

- start = RQ_QOS_IOPRIO + 1;
+ start = 1;
ret = ida_simple_get(&rq_qos_ida, start, INT_MAX, GFP_KERNEL);
if (ret < 0)
goto out;
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 4eef53f2c290..ee396367a5b2 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -14,7 +14,7 @@
struct blk_mq_debugfs_attr;

enum rq_qos_id {
- RQ_QOS_IOPRIO,
+ RQ_QOS_UNUSED,
};

struct rq_wait {
--
2.17.1


2022-01-10 09:13:32

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 11/13] blk: remove unused interfaces of blk-rq-qos

From: Wang Jianchao <[email protected]>

No functional changes here

Signed-off-by: Wang Jianchao <[email protected]>
---
block/blk-mq-debugfs.c | 10 +-------
block/blk-rq-qos.h | 52 +-----------------------------------------
2 files changed, 2 insertions(+), 60 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 90610a0cd25a..f4f5ca1953f3 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -817,11 +817,6 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q)
q->sched_debugfs_dir = NULL;
}

-static const char *rq_qos_id_to_name(enum rq_qos_id id)
-{
- return "unknown";
-}
-
void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
{
debugfs_remove_recursive(rqos->debugfs_dir);
@@ -832,9 +827,6 @@ EXPORT_SYMBOL_GPL(blk_mq_debugfs_unregister_rqos);
void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
{
struct request_queue *q = rqos->q;
- const char *dir_name;
-
- dir_name = rqos->ops->name ? rqos->ops->name : rq_qos_id_to_name(rqos->id);

if (rqos->debugfs_dir || !rqos->ops->debugfs_attrs)
return;
@@ -843,7 +835,7 @@ void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
q->rqos_debugfs_dir = debugfs_create_dir("rqos",
q->debugfs_dir);

- rqos->debugfs_dir = debugfs_create_dir(dir_name,
+ rqos->debugfs_dir = debugfs_create_dir(rqos->ops->name,
rqos->q->rqos_debugfs_dir);

debugfs_create_files(rqos->debugfs_dir, rqos, rqos->ops->debugfs_attrs);
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index ee396367a5b2..123b6b100355 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -13,10 +13,6 @@

struct blk_mq_debugfs_attr;

-enum rq_qos_id {
- RQ_QOS_UNUSED,
-};
-
struct rq_wait {
wait_queue_head_t wait;
atomic_t inflight;
@@ -28,7 +24,7 @@ struct rq_qos {
bool dying;
const struct rq_qos_ops *ops;
struct request_queue *q;
- enum rq_qos_id id;
+ int id;
struct rq_qos *next;
#ifdef CONFIG_BLK_DEBUG_FS
struct dentry *debugfs_dir;
@@ -89,52 +85,6 @@ static inline void rq_wait_init(struct rq_wait *rq_wait)
init_waitqueue_head(&rq_wait->wait);
}

-static inline void rq_qos_add(struct request_queue *q, struct rq_qos *rqos)
-{
- /*
- * No IO can be in-flight when adding rqos, so freeze queue, which
- * is fine since we only support rq_qos for blk-mq queue.
- *
- * Reuse ->queue_lock for protecting against other concurrent
- * rq_qos adding/deleting
- */
- blk_mq_freeze_queue(q);
-
- spin_lock_irq(&q->queue_lock);
- rqos->next = q->rq_qos;
- q->rq_qos = rqos;
- spin_unlock_irq(&q->queue_lock);
-
- blk_mq_unfreeze_queue(q);
-
- if (rqos->ops->debugfs_attrs)
- blk_mq_debugfs_register_rqos(rqos);
-}
-
-static inline void rq_qos_del(struct request_queue *q, struct rq_qos *rqos)
-{
- struct rq_qos **cur;
-
- /*
- * See comment in rq_qos_add() about freezing queue & using
- * ->queue_lock.
- */
- blk_mq_freeze_queue(q);
-
- spin_lock_irq(&q->queue_lock);
- for (cur = &q->rq_qos; *cur; cur = &(*cur)->next) {
- if (*cur == rqos) {
- *cur = rqos->next;
- break;
- }
- }
- spin_unlock_irq(&q->queue_lock);
-
- blk_mq_unfreeze_queue(q);
-
- blk_mq_debugfs_unregister_rqos(rqos);
-}
-
int rq_qos_register(struct rq_qos_ops *ops);
void rq_qos_unregister(struct rq_qos_ops *ops);
void rq_qos_activate(struct request_queue *q,
--
2.17.1


2022-01-10 09:14:01

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 12/13] blk: make request able to carry blkcg_gq

From: Wang Jianchao <[email protected]>

After blk_update_request, the bios can be gone. We cannot track
the req in cgroup fashion in following IO completion path. This
patch add blkcg_gq into request, get it when install bio, put it
before request is released.

Signed-off-by: Wang Jianchao <[email protected]>
---
block/Kconfig | 3 +++
block/blk-core.c | 6 +++++-
block/blk-merge.c | 9 +++++++++
block/blk-mq.c | 14 ++++++++++++++
include/linux/blk-mq.h | 4 +++-
5 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index b3a2c656a64b..ea612cb5c8ee 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -32,6 +32,9 @@ config BLK_BIO_IOCOST
config BLK_RQ_ALLOC_TIME
bool

+config BLK_RQ_BLKCG_GQ
+ bool
+
config BLK_CGROUP_RWSTAT
bool

diff --git a/block/blk-core.c b/block/blk-core.c
index 2847ab514c1f..083160895125 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1462,7 +1462,11 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
}
rq->nr_phys_segments = rq_src->nr_phys_segments;
rq->ioprio = rq_src->ioprio;
-
+#ifdef CONFIG_BLK_RQ_BLKCG_GQ
+ if (rq_src->blkg)
+ blkg_get(rq_src->blkg);
+ rq->blkg = rq_src->blkg;
+#endif
if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0)
goto free_and_out;

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 893c1a60b701..cf5d0e5ce04f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -793,6 +793,10 @@ static struct request *attempt_merge(struct request_queue *q,
if (req->ioprio != next->ioprio)
return NULL;

+#ifdef CONFIG_BLK_RQ_BLKCG_GQ
+ if (req->blkg != next->blkg)
+ return NULL;
+#endif
/*
* If we are allowed to merge, then append bio list
* from next to rq and release next. merge_requests_fn
@@ -930,6 +934,11 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
if (rq->ioprio != bio_prio(bio))
return false;

+#ifdef CONFIG_BLK_RQ_BLKCG_GQ
+ if (rq->blkg != bio->bi_blkg)
+ return false;
+#endif
+
return true;
}

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8874a63ae952..131845bca5de 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -28,6 +28,7 @@
#include <linux/crash_dump.h>
#include <linux/prefetch.h>
#include <linux/blk-crypto.h>
+#include <linux/blk-cgroup.h>

#include <trace/events/block.h>

@@ -369,6 +370,9 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
rq->nr_phys_segments = 0;
#if defined(CONFIG_BLK_DEV_INTEGRITY)
rq->nr_integrity_segments = 0;
+#endif
+#ifdef CONFIG_BLK_RQ_BLKCG_GQ
+ rq->blkg = NULL;
#endif
rq->end_io = NULL;
rq->end_io_data = NULL;
@@ -600,6 +604,10 @@ static void __blk_mq_free_request(struct request *rq)
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
const int sched_tag = rq->internal_tag;

+#ifdef CONFIG_BLK_RQ_BLKCG_GQ
+ if (rq->blkg)
+ blkg_put(rq->blkg);
+#endif
blk_crypto_free_request(rq);
blk_pm_mark_last_busy(rq);
rq->mq_hctx = NULL;
@@ -2305,6 +2313,12 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
rq->__sector = bio->bi_iter.bi_sector;
rq->write_hint = bio->bi_write_hint;
blk_rq_bio_prep(rq, bio, nr_segs);
+#ifdef CONFIG_BLK_RQ_BLKCG_GQ
+ if (bio->bi_blkg) {
+ blkg_get(bio->bi_blkg);
+ rq->blkg = bio->bi_blkg;
+ }
+#endif

/* This can't fail, since GFP_NOIO includes __GFP_DIRECT_RECLAIM. */
err = blk_crypto_rq_bio_prep(rq, bio, GFP_NOIO);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2949d9ac7484..f9cc6f6b8d63 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -110,7 +110,9 @@ struct request {
u64 start_time_ns;
/* Time that I/O was submitted to the device. */
u64 io_start_time_ns;
-
+#ifdef CONFIG_BLK_RQ_BLKCG_GQ
+ struct blkcg_gq *blkg;
+#endif
#ifdef CONFIG_BLK_WBT
unsigned short wbt_flags;
#endif
--
2.17.1


2022-01-10 09:14:05

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 06/13] blk: remove unused BLK_RQ_IO_DATA_LEN

From: Wang Jianchao <[email protected]>

Remove it as nobody use it any more.

Signed-off-by: Wang Jianchao <[email protected]>
---
block/Kconfig | 1 -
1 file changed, 1 deletion(-)

diff --git a/block/Kconfig b/block/Kconfig
index 1c0d05df2aec..50cc1b56852c 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -133,7 +133,6 @@ config BLK_CGROUP_FC_APPID
config BLK_CGROUP_IOCOST
bool "Enable support for cost model based cgroup IO controller"
depends on BLK_CGROUP
- select BLK_RQ_IO_DATA_LEN
select BLK_RQ_ALLOC_TIME
help
Enabling this option enables the .weight interface for cost
--
2.17.1


2022-01-10 09:14:08

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 09/13] blk: rename ioprio.c to ioprio-common.c

From: Wang Jianchao <[email protected]>

In next patch, blk-ioprio.c is changed to a module named ioprio.ko.
Rename ioprio.c to ioprio-common.c to avoid same ioprio.o in Makefile

Signed-off-by: Wang Jianchao <[email protected]>
---
block/Makefile | 2 +-
block/{ioprio.c => ioprio-common.c} | 0
2 files changed, 1 insertion(+), 1 deletion(-)
rename block/{ioprio.c => ioprio-common.c} (100%)

diff --git a/block/Makefile b/block/Makefile
index 8950913cbcc9..beacc3a03c8b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,7 +8,7 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
blk-exec.o blk-merge.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
- genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
+ genhd.o ioprio-common.o badblocks.o partitions/ blk-rq-qos.o \
disk-events.o blk-ia-ranges.o

obj-$(CONFIG_BOUNCE) += bounce.o
diff --git a/block/ioprio.c b/block/ioprio-common.c
similarity index 100%
rename from block/ioprio.c
rename to block/ioprio-common.c
--
2.17.1


2022-01-10 09:14:23

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 07/13] blk: use standalone macro to control bio.bi_iocost_cost

From: Wang Jianchao <[email protected]>

This is a preparation to make iocost modular

Signed-off-by: Wang Jianchao <[email protected]>
---
block/Kconfig | 4 ++++
block/bio.c | 2 +-
include/linux/blk_types.h | 2 +-
3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 50cc1b56852c..e1b1bff5c1e9 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -26,6 +26,9 @@ menuconfig BLOCK

if BLOCK

+config BLK_BIO_IOCOST
+ bool
+
config BLK_RQ_ALLOC_TIME
bool

@@ -134,6 +137,7 @@ config BLK_CGROUP_IOCOST
bool "Enable support for cost model based cgroup IO controller"
depends on BLK_CGROUP
select BLK_RQ_ALLOC_TIME
+ select BLK_BIO_IOCOST
help
Enabling this option enables the .weight interface for cost
model based proportional IO control. The IO controller
diff --git a/block/bio.c b/block/bio.c
index 15ab0d6d1c06..a9e2347b0021 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -268,7 +268,7 @@ void bio_init(struct bio *bio, struct bio_vec *table,
#ifdef CONFIG_BLK_CGROUP
bio->bi_blkg = NULL;
bio->bi_issue.value = 0;
-#ifdef CONFIG_BLK_CGROUP_IOCOST
+#ifdef CONFIG_BLK_BIO_IOCOST
bio->bi_iocost_cost = 0;
#endif
#endif
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index fe065c394fff..495ffc29bab0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -261,7 +261,7 @@ struct bio {
*/
struct blkcg_gq *bi_blkg;
struct bio_issue bi_issue;
-#ifdef CONFIG_BLK_CGROUP_IOCOST
+#ifdef CONFIG_BLK_BIO_IOCOST
u64 bi_iocost_cost;
#endif
#endif
--
2.17.1


2022-01-10 09:14:31

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 13/13] blk: introduce iostat per cgroup module

From: Wang Jianchao <[email protected]>

iostat can only track the whole device's io statistics. This patch
introduces iostat per cgroup based on blk-rq-qos framework which
can track bw, iops, queue latency and device latency and distinguish
regular or meta data. The blkio.iostat per cgroup output in following
format,
vda-data bytes iops queue_lat dev_lat [ditto] [ditto]
meta \___________ ______________/ | |
v v v
read write discard
In particular, the blkio.iostat of root only output the statistics
of IOs from root cgroup. However, the non-root blkio.iostat outputs
all of the children cgroups. With meta stats in root cgroup, hope
to observe the performace of fs metadata.

Signed-off-by: Wang Jianchao <[email protected]>
---
block/Kconfig | 9 ++
block/Makefile | 2 +
block/blk-iostat.c | 356 +++++++++++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 2 +-
4 files changed, 368 insertions(+), 1 deletion(-)
create mode 100644 block/blk-iostat.c

diff --git a/block/Kconfig b/block/Kconfig
index ea612cb5c8ee..35f24db3ec92 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -156,6 +156,15 @@ config BLK_CGROUP_IOPRIO
scheduler and block devices process requests. Only some I/O schedulers
and some block devices support I/O priorities.

+config BLK_CGROUP_IOSTAT
+ tristate "IO statistics monitor per cgroup"
+ select BLK_RQ_BLKCG_GQ
+ select BLK_RQ_ALLOC_TIME
+ depends on BLK_CGROUP
+ help
+ Monitor IO statistics, including bw, iops, queue latency and device
+ latency, in per-cgroup level.
+
config BLK_DEBUG_FS
bool "Block layer debugging information in debugfs"
default y
diff --git a/block/Makefile b/block/Makefile
index 3f76836076b2..ad89015e37ce 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -23,6 +23,8 @@ iolat-y := blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += iolat.o
iocost-y := blk-iocost.o
obj-$(CONFIG_BLK_CGROUP_IOCOST) += iocost.o
+iostat-y := blk-iostat.o
+obj-$(CONFIG_BLK_CGROUP_IOSTAT) += iostat.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
diff --git a/block/blk-iostat.c b/block/blk-iostat.c
new file mode 100644
index 000000000000..3c6bcb6ab055
--- /dev/null
+++ b/block/blk-iostat.c
@@ -0,0 +1,356 @@
+#include <linux/kernel.h>
+#include <linux/blk_types.h>
+#include <linux/module.h>
+#include <linux/blk-cgroup.h>
+#include <linux/bio.h>
+#include <linux/spinlock.h>
+
+#include "blk.h"
+#include "blk-rq-qos.h"
+
+enum {
+ IOSTAT_READ = 0,
+ IOSTAT_WRITE,
+ IOSTAT_DISCARD,
+ IOSTAT_MAX,
+};
+
+struct iostat_data {
+ u64 bytes[IOSTAT_MAX];
+ u64 ios[IOSTAT_MAX];
+ u64 queue_lat[IOSTAT_MAX];
+ u64 dev_lat[IOSTAT_MAX];
+};
+
+struct iostat_queue {
+ struct rq_qos rqos;
+};
+
+struct iostat_gq {
+ struct blkg_policy_data pd;
+ char disk_name[DISK_NAME_LEN];
+ struct {
+ struct iostat_data __percpu *data;
+ struct iostat_data __percpu *meta;
+ } stat;
+};
+
+struct iostat_cgrp {
+ struct blkcg_policy_data cpd;
+};
+
+DEFINE_MUTEX(iostat_mutex);
+
+static struct blkcg_policy blkcg_policy_iostat;
+
+static inline struct iostat_gq *pd_to_ist(struct blkg_policy_data *pd)
+{
+ return pd ? container_of(pd, struct iostat_gq, pd) : NULL;
+}
+
+static inline struct iostat_gq *blkg_to_ist(struct blkcg_gq *blkg)
+{
+ return pd_to_ist(blkg_to_pd(blkg, &blkcg_policy_iostat));
+}
+
+static inline bool req_is_meta(struct request *req)
+{
+ return req->cmd_flags & REQ_META;
+}
+
+static inline int iostat_op(struct request *req)
+{
+ int op;
+
+ if (unlikely(op_is_discard(req_op(req))))
+ op = IOSTAT_DISCARD;
+ else if (op_is_write(req_op(req)))
+ op = IOSTAT_WRITE;
+ else
+ op = IOSTAT_READ;
+
+ return op;
+}
+
+static void __iostat_issue(struct rq_qos *rqos,
+ struct iostat_gq *is, struct request *req)
+{
+ struct iostat_data *stat;
+ int op = iostat_op(req);
+
+ /*
+ * blk_mq_start_request() inherents bio_issue_time() when BLK_CGROUP
+ * to avoid overhead of readtsc.
+ */
+ req->io_start_time_ns = ktime_get_ns();
+ if (req_is_meta(req))
+ stat = get_cpu_ptr(is->stat.meta);
+ else
+ stat = get_cpu_ptr(is->stat.data);
+ /*
+ * alloc_time_ns is get before get tag, we use it monitor depth,
+ * tag waits and in queue time.
+ */
+ stat->queue_lat[op] += req->io_start_time_ns - req->alloc_time_ns;
+ stat->ios[op]++;
+ stat->bytes[op] += blk_rq_bytes(req);
+ put_cpu_ptr(stat);
+}
+
+static void iostat_issue(struct rq_qos *rqos, struct request *req)
+{
+ struct iostat_gq *is;
+
+ if (unlikely(!req->bio))
+ return;
+
+ is = blkg_to_ist(req->blkg);
+ /*
+ * Most of time, bios from submit_bio would have the valid bi_blkg,
+ * however, blk_execute_rq case is an exception.
+ */
+ if (is)
+ __iostat_issue(rqos, is, req);
+}
+
+static void __iostat_done(struct rq_qos *rq_qos,
+ struct iostat_gq *is, struct request *req)
+{
+ struct iostat_data *stat;
+ int op = iostat_op(req);
+
+ if (req_is_meta(req))
+ stat = get_cpu_ptr(is->stat.meta);
+ else
+ stat = get_cpu_ptr(is->stat.data);
+ if (req->io_start_time_ns)
+ stat->dev_lat[op] += ktime_get_ns() - req->io_start_time_ns;
+ put_cpu_ptr(stat);
+}
+
+static void iostat_done(struct rq_qos *rqos, struct request *req)
+{
+ struct iostat_gq *is = blkg_to_ist(req->blkg);
+
+ if (is)
+ __iostat_done(rqos, is, req);
+}
+
+static void iostat_exit(struct rq_qos *rqos)
+{
+ struct iostat_queue *isq = container_of(rqos, struct iostat_queue, rqos);
+
+ blkcg_deactivate_policy(rqos->q, &blkcg_policy_iostat);
+ rq_qos_deactivate(rqos);
+ kfree(isq);
+}
+
+static int iostat_init(struct request_queue *q);
+
+struct rq_qos_ops iostat_rq_ops = {
+#if IS_MODULE(CONFIG_BLK_CGROUP_IOLATENCY)
+ .owner = THIS_MODULE,
+#endif
+ .name = "iostat",
+ .flags = RQOS_FLAG_CGRP_POL | RQOS_FLAG_RQ_ALLOC_TIME,
+ .issue = iostat_issue,
+ .done = iostat_done,
+ .exit = iostat_exit,
+ .init = iostat_init,
+};
+
+static int iostat_init(struct request_queue *q)
+{
+ struct iostat_queue *isq;
+ struct rq_qos *rqos;
+ int ret;
+
+ isq = kzalloc_node(sizeof(*isq), GFP_KERNEL, q->node);
+ if (!isq) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, q);
+ rqos = &isq->rqos;
+ rq_qos_activate(q, rqos, &iostat_rq_ops);
+
+ ret = blkcg_activate_policy(q, &blkcg_policy_iostat);
+ if (ret) {
+ rq_qos_deactivate(rqos);
+ kfree(isq);
+ }
+out:
+ return ret;
+}
+
+static void iostat_sum(struct blkcg_gq *blkg,
+ struct iostat_data *sum, bool meta)
+{
+ struct iostat_gq *is = blkg_to_ist(blkg);
+ struct iostat_data *stat;
+ int cpu, i;
+
+ for_each_possible_cpu(cpu) {
+ if (meta)
+ stat = per_cpu_ptr(is->stat.meta, cpu);
+ else
+ stat = per_cpu_ptr(is->stat.data, cpu);
+ for (i = 0; i < IOSTAT_MAX; i++) {
+ sum->bytes[i] += stat->bytes[i];
+ sum->ios[i] += stat->ios[i];
+ sum->dev_lat[i] += stat->dev_lat[i];
+ sum->queue_lat[i] += stat->queue_lat[i];
+ }
+ }
+}
+
+static int iostat_show(struct seq_file *sf, void *v)
+{
+ struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+ struct cgroup_subsys_state *pos_css;
+ struct iostat_gq *is;
+ struct blkcg_gq *blkg, *pos_blkg;
+ struct iostat_data data_sum, meta_sum;
+ int i;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
+ is = blkg_to_ist(blkg);
+ /*
+ * The is activated on demand so iostat may be NULL
+ */
+ if (!is)
+ continue;
+
+ memset(&data_sum, 0, sizeof(data_sum));
+ memset(&meta_sum, 0, sizeof(meta_sum));
+ if (blkg == blkg->q->root_blkg) {
+ iostat_sum(blkg, &data_sum, false);
+ iostat_sum(blkg, &meta_sum, true);
+ } else {
+ /*
+ * Iterate every children blkg to agregate statistics
+ */
+ blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) {
+ if (!pos_blkg->online)
+ continue;
+ iostat_sum(pos_blkg, &data_sum, false);
+ iostat_sum(pos_blkg, &meta_sum, true);
+ }
+ }
+
+ seq_printf(sf, "%s-data ", is->disk_name);
+ for (i = 0; i < IOSTAT_MAX; i++)
+ seq_printf(sf, "%llu %llu %llu %llu ",
+ data_sum.bytes[i], data_sum.ios[i],
+ data_sum.queue_lat[i], data_sum.dev_lat[i]);
+ seq_printf(sf, "\n");
+ seq_printf(sf, "%s-meta ", is->disk_name);
+ for (i = 0; i < IOSTAT_MAX; i++)
+ seq_printf(sf, "%llu %llu %llu %llu ",
+ meta_sum.bytes[i], meta_sum.ios[i],
+ meta_sum.queue_lat[i], meta_sum.dev_lat[i]);
+ seq_printf(sf, "\n");
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+
+static struct cftype iostat_files[] = {
+ {
+ .name = "iostat",
+ .seq_show = iostat_show,
+ },
+ {}
+};
+
+static struct cftype iostat_legacy_files[] = {
+ {
+ .name = "iostat",
+ .seq_show = iostat_show,
+ },
+ {}
+};
+
+static void iostat_pd_free(struct blkg_policy_data *pd)
+{
+ struct iostat_gq *is = pd_to_ist(pd);
+
+ if (is->stat.data)
+ free_percpu(is->stat.data);
+
+ if (is->stat.meta)
+ free_percpu(is->stat.meta);
+
+ kfree(is);
+}
+
+static struct blkg_policy_data *iostat_pd_alloc(gfp_t gfp,
+ struct request_queue *q, struct blkcg *blkcg)
+{
+ struct iostat_gq *is;
+
+ is = kzalloc_node(sizeof(*is), gfp, q->node);
+ if (!is)
+ return NULL;
+
+ is->stat.data = __alloc_percpu_gfp(sizeof(struct iostat_data),
+ __alignof__(struct iostat_data), gfp);
+ if (!is->stat.data)
+ goto out_free;
+
+ is->stat.meta = __alloc_percpu_gfp(sizeof(struct iostat_data),
+ __alignof__(struct iostat_data), gfp);
+ if (!is->stat.meta)
+ goto out_free;
+ /*
+ * request_queue.kobj's parent is gendisk
+ */
+ strlcpy(is->disk_name,
+ kobject_name(q->kobj.parent),
+ DISK_NAME_LEN);
+ return &is->pd;
+out_free:
+ if (is->stat.data)
+ free_percpu(is->stat.data);
+ iostat_pd_free(&is->pd);
+ return NULL;
+}
+
+static struct blkcg_policy blkcg_policy_iostat = {
+ .dfl_cftypes = iostat_files,
+ .legacy_cftypes = iostat_legacy_files,
+ .pd_alloc_fn = iostat_pd_alloc,
+ .pd_free_fn = iostat_pd_free,
+};
+
+static int __init iostat_mod_init(void)
+{
+ int ret;
+
+ ret = rq_qos_register(&iostat_rq_ops);
+ if (ret)
+ return ret;
+
+ ret = blkcg_policy_register(&blkcg_policy_iostat);
+ if (ret) {
+ rq_qos_unregister(&iostat_rq_ops);
+ return ret;
+ }
+
+ return 0;
+}
+
+static void __exit iostat_mod_exit(void)
+{
+ rq_qos_unregister(&iostat_rq_ops);
+ blkcg_policy_unregister(&blkcg_policy_iostat);
+}
+
+module_init(iostat_mod_init);
+module_exit(iostat_mod_exit);
+MODULE_AUTHOR("Wang Jianchao");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Block Statistics per Cgroup");
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed30b3c3fee7..75026cf54384 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -42,7 +42,7 @@ struct blk_crypto_profile;
* Maximum number of blkcg policies allowed to be registered concurrently.
* Defined here to simplify include dependency.
*/
-#define BLKCG_MAX_POLS 6
+#define BLKCG_MAX_POLS 7
/*
* Non blk-rq-qos blkcg policies include blk-throttle and bfq
*/
--
2.17.1


2022-01-10 09:14:47

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 05/13] blk-iolatency: make iolatency pluggable and modular

From: Wang Jianchao <[email protected]>

Make blk-iolatency pluggable and modular. Then we can close or open
it through /sys/block/xxx/queue/qos and rmmod the module if we don't
need it which can release one blkcg policy slot.

Signed-off-by: Wang Jianchao <[email protected]>
---
block/Kconfig | 2 +-
block/Makefile | 4 +++-
block/blk-cgroup.c | 6 ------
block/blk-iolatency.c | 39 +++++++++++++++++++++++++++++++--------
block/blk-mq-debugfs.c | 2 --
block/blk-rq-qos.h | 6 ------
block/blk.h | 6 ------
7 files changed, 35 insertions(+), 30 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index c6ce41a5e5b2..1c0d05df2aec 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -111,7 +111,7 @@ config BLK_WBT_MQ
Enable writeback throttling by default for request-based block devices.

config BLK_CGROUP_IOLATENCY
- bool "Enable support for latency based cgroup IO protection"
+ tristate "Enable support for latency based cgroup IO protection"
depends on BLK_CGROUP
help
Enabling this option enables the .latency interface for IO throttling.
diff --git a/block/Makefile b/block/Makefile
index 44df57e562bf..ccf61c57e1d4 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -18,7 +18,9 @@ obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_CGROUP_RWSTAT) += blk-cgroup-rwstat.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOPRIO) += blk-ioprio.o
-obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
+iolat-y := blk-iolatency.o
+obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += iolat.o
+
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fb56d74f1c8e..fd874dfd38ed 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1212,12 +1212,6 @@ int blkcg_init_queue(struct request_queue *q)
if (ret)
goto err_destroy_all;

- ret = blk_iolatency_init(q);
- if (ret) {
- blk_throtl_exit(q);
- goto err_destroy_all;
- }
-
return 0;

err_destroy_all:
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 6593c7123b97..6aaf0775e484 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -90,6 +90,12 @@ struct blk_iolatency {
atomic_t enabled;
};

+static struct rq_qos_ops blkcg_iolatency_ops;
+static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
+{
+ return rq_qos_by_id(q, blkcg_iolatency_ops.id);
+}
+
static inline struct blk_iolatency *BLKIOLATENCY(struct rq_qos *rqos)
{
return container_of(rqos, struct blk_iolatency, rqos);
@@ -646,13 +652,21 @@ static void blkcg_iolatency_exit(struct rq_qos *rqos)

del_timer_sync(&blkiolat->timer);
blkcg_deactivate_policy(rqos->q, &blkcg_policy_iolatency);
+ rq_qos_deactivate(rqos);
kfree(blkiolat);
}

+static int blk_iolatency_init(struct request_queue *q);
static struct rq_qos_ops blkcg_iolatency_ops = {
+#if IS_MODULE(CONFIG_BLK_CGROUP_IOLATENCY)
+ .owner = THIS_MODULE,
+#endif
+ .name = "iolat",
+ .flags = RQOS_FLAG_CGRP_POL,
.throttle = blkcg_iolatency_throttle,
.done_bio = blkcg_iolatency_done_bio,
.exit = blkcg_iolatency_exit,
+ .init = blk_iolatency_init,
};

static void blkiolatency_timer_fn(struct timer_list *t)
@@ -727,15 +741,10 @@ int blk_iolatency_init(struct request_queue *q)
return -ENOMEM;

rqos = &blkiolat->rqos;
- rqos->id = RQ_QOS_LATENCY;
- rqos->ops = &blkcg_iolatency_ops;
- rqos->q = q;
-
- rq_qos_add(q, rqos);
-
+ rq_qos_activate(q, rqos, &blkcg_iolatency_ops);
ret = blkcg_activate_policy(q, &blkcg_policy_iolatency);
if (ret) {
- rq_qos_del(q, rqos);
+ rq_qos_deactivate(rqos);
kfree(blkiolat);
return ret;
}
@@ -1046,13 +1055,27 @@ static struct blkcg_policy blkcg_policy_iolatency = {

static int __init iolatency_init(void)
{
- return blkcg_policy_register(&blkcg_policy_iolatency);
+ int ret;
+
+ ret = rq_qos_register(&blkcg_iolatency_ops);
+ if (ret)
+ return ret;
+
+ ret = blkcg_policy_register(&blkcg_policy_iolatency);
+ if (ret)
+ rq_qos_unregister(&blkcg_iolatency_ops);
+
+ return ret;
}

static void __exit iolatency_exit(void)
{
blkcg_policy_unregister(&blkcg_policy_iolatency);
+ rq_qos_unregister(&blkcg_iolatency_ops);
}

module_init(iolatency_init);
module_exit(iolatency_exit);
+MODULE_AUTHOR("Josef Bacik");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Latency based cgroup IO protection");
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 9c786b63c847..57c33f4730f2 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -820,8 +820,6 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q)
static const char *rq_qos_id_to_name(enum rq_qos_id id)
{
switch (id) {
- case RQ_QOS_LATENCY:
- return "latency";
case RQ_QOS_COST:
return "cost";
case RQ_QOS_IOPRIO:
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index de82eb951bdd..6ca46c69e325 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -14,7 +14,6 @@
struct blk_mq_debugfs_attr;

enum rq_qos_id {
- RQ_QOS_LATENCY,
RQ_QOS_COST,
RQ_QOS_IOPRIO,
};
@@ -85,11 +84,6 @@ static inline struct rq_qos *rq_qos_by_id(struct request_queue *q, int id)
return rqos;
}

-static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
-{
- return rq_qos_by_id(q, RQ_QOS_LATENCY);
-}
-
static inline void rq_wait_init(struct rq_wait *rq_wait)
{
atomic_set(&rq_wait->inflight, 0);
diff --git a/block/blk.h b/block/blk.h
index ccde6e6f1736..e2e4fbb9a58d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -436,12 +436,6 @@ static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
__blk_queue_bounce(q, bio);
}

-#ifdef CONFIG_BLK_CGROUP_IOLATENCY
-extern int blk_iolatency_init(struct request_queue *q);
-#else
-static inline int blk_iolatency_init(struct request_queue *q) { return 0; }
-#endif
-
struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);

#ifdef CONFIG_BLK_DEV_ZONED
--
2.17.1


2022-01-10 09:14:47

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 04/13] cgroup: export following two interfaces

From: Wang Jianchao <[email protected]>

This is a preparation for making blk-rq-qos modular, there is no
functional change, but just export interfaces pr_cont_cgroup_path
and cgroup_parse_float.

Signed-off-by: Wang Jianchao <[email protected]>
---
include/linux/cgroup.h | 5 +----
kernel/cgroup/cgroup.c | 7 +++++++
2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 75c151413fda..1a67b0db00db 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -666,10 +666,7 @@ static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
pr_cont_kernfs_name(cgrp->kn);
}

-static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
-{
- pr_cont_kernfs_path(cgrp->kn);
-}
+void pr_cont_cgroup_path(struct cgroup *cgrp);

static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 919194de39c8..f358d5122033 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6597,6 +6597,13 @@ int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v)
*v = whole * power_of_ten(dec_shift) + frac;
return 0;
}
+EXPORT_SYMBOL_GPL(cgroup_parse_float);
+
+void pr_cont_cgroup_path(struct cgroup *cgrp)
+{
+ pr_cont_kernfs_path(cgrp->kn);
+}
+EXPORT_SYMBOL_GPL(pr_cont_cgroup_path);

/*
* sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
--
2.17.1


2022-01-10 09:14:51

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 03/13] blk: export following interfaces

From: Wang Jianchao <[email protected]>

This is a preparation for making blk-rq-qos policyies modular,
there is no functional change.

Signed-off-by: Wang Jianchao <[email protected]>
---
block/bdev.c | 5 -----
block/blk-cgroup.c | 12 ++++++++++++
block/blk-mq-debugfs.c | 2 ++
block/blk-rq-qos.c | 2 ++
block/blk-stat.c | 30 ------------------------------
block/blk-stat.h | 31 ++++++++++++++++++++++++++++---
include/linux/blk-cgroup.h | 1 +
include/linux/blkdev.h | 5 ++++-
8 files changed, 49 insertions(+), 39 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index b1d087e5e205..35d8c71be741 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -761,11 +761,6 @@ struct block_device *blkdev_get_no_open(dev_t dev)
return bdev;
}

-void blkdev_put_no_open(struct block_device *bdev)
-{
- put_device(&bdev->bd_device);
-}
-
/**
* blkdev_get_by_dev - open a block device by device number
* @dev: device number of block device to open
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 663aabfeba18..fb56d74f1c8e 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -55,10 +55,18 @@ static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
static LIST_HEAD(all_blkcgs); /* protected by blkcg_pol_mutex */

bool blkcg_debug_stats = false;
+EXPORT_SYMBOL_GPL(blkcg_debug_stats);
+
static struct workqueue_struct *blkcg_punt_bio_wq;

#define BLKG_DESTROY_BATCH_SIZE 64

+bool blkcg_debug_stats_enabled(void)
+{
+ return blkcg_debug_stats;
+}
+EXPORT_SYMBOL_GPL(blkcg_debug_stats_enabled);
+
static bool blkcg_policy_enabled(struct request_queue *q,
const struct blkcg_policy *pol)
{
@@ -494,6 +502,7 @@ const char *blkg_dev_name(struct blkcg_gq *blkg)
return NULL;
return bdi_dev_name(blkg->q->disk->bdi);
}
+EXPORT_SYMBOL_GPL(blkg_dev_name);

/**
* blkcg_print_blkgs - helper for printing per-blkg data
@@ -606,6 +615,7 @@ struct block_device *blkcg_conf_open_bdev(char **inputp)
*inputp = input;
return bdev;
}
+EXPORT_SYMBOL_GPL(blkcg_conf_open_bdev);

/**
* blkg_conf_prep - parse and prepare for per-blkg config update
@@ -1778,6 +1788,7 @@ void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay)
current->use_memdelay = use_memdelay;
set_notify_resume(current);
}
+EXPORT_SYMBOL_GPL(blkcg_schedule_throttle);

/**
* blkcg_add_delay - add delay to this blkg
@@ -1795,6 +1806,7 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
blkcg_scale_delay(blkg, now);
atomic64_add(delta, &blkg->delay_nsec);
}
+EXPORT_SYMBOL_GPL(blkcg_add_delay);

/**
* blkg_tryget_closest - try and get a blkg ref on the closet blkg
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index e225db3c271f..9c786b63c847 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -835,6 +835,7 @@ void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
debugfs_remove_recursive(rqos->debugfs_dir);
rqos->debugfs_dir = NULL;
}
+EXPORT_SYMBOL_GPL(blk_mq_debugfs_unregister_rqos);

void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
{
@@ -855,6 +856,7 @@ void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)

debugfs_create_files(rqos->debugfs_dir, rqos, rqos->ops->debugfs_attrs);
}
+EXPORT_SYMBOL_GPL(blk_mq_debugfs_register_rqos);

void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q)
{
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index a94ff872722b..08ccd4a4e913 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -33,6 +33,7 @@ bool rq_wait_inc_below(struct rq_wait *rq_wait, unsigned int limit)
{
return atomic_inc_below(&rq_wait->inflight, limit);
}
+EXPORT_SYMBOL_GPL(rq_wait_inc_below);

void __rq_qos_cleanup(struct rq_qos *rqos, struct bio *bio)
{
@@ -296,6 +297,7 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data,
} while (1);
finish_wait(&rqw->wait, &data.wq);
}
+EXPORT_SYMBOL_GPL(rq_qos_wait);

void rq_qos_exit(struct request_queue *q)
{
diff --git a/block/blk-stat.c b/block/blk-stat.c
index ae3dd1fb8e61..2b0c530b0c9c 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -18,36 +18,6 @@ struct blk_queue_stats {
bool enable_accounting;
};

-void blk_rq_stat_init(struct blk_rq_stat *stat)
-{
- stat->min = -1ULL;
- stat->max = stat->nr_samples = stat->mean = 0;
- stat->batch = 0;
-}
-
-/* src is a per-cpu stat, mean isn't initialized */
-void blk_rq_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
-{
- if (!src->nr_samples)
- return;
-
- dst->min = min(dst->min, src->min);
- dst->max = max(dst->max, src->max);
-
- dst->mean = div_u64(src->batch + dst->mean * dst->nr_samples,
- dst->nr_samples + src->nr_samples);
-
- dst->nr_samples += src->nr_samples;
-}
-
-void blk_rq_stat_add(struct blk_rq_stat *stat, u64 value)
-{
- stat->min = min(stat->min, value);
- stat->max = max(stat->max, value);
- stat->batch += value;
- stat->nr_samples++;
-}
-
void blk_stat_add(struct request *rq, u64 now)
{
struct request_queue *q = rq->q;
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 17b47a86eefb..2642969594d8 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -164,8 +164,33 @@ static inline void blk_stat_activate_msecs(struct blk_stat_callback *cb,
mod_timer(&cb->timer, jiffies + msecs_to_jiffies(msecs));
}

-void blk_rq_stat_add(struct blk_rq_stat *, u64);
-void blk_rq_stat_sum(struct blk_rq_stat *, struct blk_rq_stat *);
-void blk_rq_stat_init(struct blk_rq_stat *);
+static inline void blk_rq_stat_init(struct blk_rq_stat *stat)
+{
+ stat->min = -1ULL;
+ stat->max = stat->nr_samples = stat->mean = 0;
+ stat->batch = 0;
+}
+
+/* src is a per-cpu stat, mean isn't initialized */
+static inline void blk_rq_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
+{
+ if (!src->nr_samples)
+ return;
+
+ dst->min = min(dst->min, src->min);
+ dst->max = max(dst->max, src->max);

+ dst->mean = div_u64(src->batch + dst->mean * dst->nr_samples,
+ dst->nr_samples + src->nr_samples);
+
+ dst->nr_samples += src->nr_samples;
+}
+
+static inline void blk_rq_stat_add(struct blk_rq_stat *stat, u64 value)
+{
+ stat->min = min(stat->min, value);
+ stat->max = max(stat->max, value);
+ stat->batch += value;
+ stat->nr_samples++;
+}
#endif
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index b4de2010fba5..b87a1bdde675 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -179,6 +179,7 @@ struct blkcg_policy {
extern struct blkcg blkcg_root;
extern struct cgroup_subsys_state * const blkcg_root_css;
extern bool blkcg_debug_stats;
+bool blkcg_debug_stats_enabled(void);

struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
struct request_queue *q, bool update_hint);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e7dce2232814..ed30b3c3fee7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1297,7 +1297,10 @@ void blkdev_put(struct block_device *bdev, fmode_t mode);

/* just for blk-cgroup, don't use elsewhere */
struct block_device *blkdev_get_no_open(dev_t dev);
-void blkdev_put_no_open(struct block_device *bdev);
+static inline void blkdev_put_no_open(struct block_device *bdev)
+{
+ put_device(&bdev->bd_device);
+}

struct block_device *bdev_alloc(struct gendisk *disk, u8 partno);
void bdev_add(struct block_device *bdev, dev_t dev);
--
2.17.1


2022-01-10 09:15:01

by Wang Jianchao

[permalink] [raw]
Subject: [PATCH 02/13] blk-wbt: make wbt pluggable

From: Wang Jianchao <[email protected]>

This patch makes wbt pluggable through /sys/block/xxx/queue/qos.

Signed-off-by: Wang Jianchao <[email protected]>
---
block/blk-mq-debugfs.c | 2 --
block/blk-rq-qos.h | 8 ++------
block/blk-sysfs.c | 7 ++-----
block/blk-wbt.c | 30 +++++++++++++++++++++++++-----
block/blk-wbt.h | 8 ++++----
5 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index e3e8d54c836f..e225db3c271f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -820,8 +820,6 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q)
static const char *rq_qos_id_to_name(enum rq_qos_id id)
{
switch (id) {
- case RQ_QOS_WBT:
- return "wbt";
case RQ_QOS_LATENCY:
return "latency";
case RQ_QOS_COST:
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index c2b9b41f8fd4..de82eb951bdd 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -14,7 +14,6 @@
struct blk_mq_debugfs_attr;

enum rq_qos_id {
- RQ_QOS_WBT,
RQ_QOS_LATENCY,
RQ_QOS_COST,
RQ_QOS_IOPRIO,
@@ -86,11 +85,6 @@ static inline struct rq_qos *rq_qos_by_id(struct request_queue *q, int id)
return rqos;
}

-static inline struct rq_qos *wbt_rq_qos(struct request_queue *q)
-{
- return rq_qos_by_id(q, RQ_QOS_WBT);
-}
-
static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
{
return rq_qos_by_id(q, RQ_QOS_LATENCY);
@@ -158,6 +152,8 @@ ssize_t queue_qos_store(struct request_queue *q, const char *page,
size_t count);
struct rq_qos *rq_qos_get(struct request_queue *q, int id);
void rq_qos_put(struct rq_qos *rqos);
+int rq_qos_switch(struct request_queue *q, const struct rq_qos_ops *ops,
+ struct rq_qos *rqos);

static inline struct rq_qos *rq_qos_by_name(struct request_queue *q,
const char *name)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 91f980985b1b..12399e491670 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -482,11 +482,8 @@ static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
return -EINVAL;

rqos = wbt_rq_qos(q);
- if (!rqos) {
- ret = wbt_init(q);
- if (ret)
- return ret;
- }
+ if (!rqos)
+ return -EOPNOTSUPP;

if (val == -1)
val = wbt_default_latency_nsec(q);
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 88265ae4fa41..ce4b41e50564 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -31,6 +31,13 @@
#define CREATE_TRACE_POINTS
#include <trace/events/wbt.h>

+static struct rq_qos_ops wbt_rqos_ops;
+
+struct rq_qos *wbt_rq_qos(struct request_queue *q)
+{
+ return rq_qos_by_id(q, wbt_rqos_ops.id);
+}
+
static inline void wbt_clear_state(struct request *rq)
{
rq->wbt_flags = 0;
@@ -656,7 +663,7 @@ void wbt_enable_default(struct request_queue *q)
return;

if (queue_is_mq(q) && IS_ENABLED(CONFIG_BLK_WBT_MQ))
- wbt_init(q);
+ rq_qos_switch(q, &wbt_rqos_ops, NULL);
}
EXPORT_SYMBOL_GPL(wbt_enable_default);

@@ -696,6 +703,7 @@ static void wbt_exit(struct rq_qos *rqos)
struct rq_wb *rwb = RQWB(rqos);
struct request_queue *q = rqos->q;

+ rq_qos_deactivate(rqos);
blk_stat_remove_callback(q, rwb->cb);
blk_stat_free_callback(rwb->cb);
kfree(rwb);
@@ -806,7 +814,9 @@ static const struct blk_mq_debugfs_attr wbt_debugfs_attrs[] = {
};
#endif

+int wbt_init(struct request_queue *q);
static struct rq_qos_ops wbt_rqos_ops = {
+ .name = "wbt",
.throttle = wbt_wait,
.issue = wbt_issue,
.track = wbt_track,
@@ -815,6 +825,7 @@ static struct rq_qos_ops wbt_rqos_ops = {
.cleanup = wbt_cleanup,
.queue_depth_changed = wbt_queue_depth_changed,
.exit = wbt_exit,
+ .init = wbt_init,
#ifdef CONFIG_BLK_DEBUG_FS
.debugfs_attrs = wbt_debugfs_attrs,
#endif
@@ -838,9 +849,6 @@ int wbt_init(struct request_queue *q)
for (i = 0; i < WBT_NUM_RWQ; i++)
rq_wait_init(&rwb->rq_wait[i]);

- rwb->rqos.id = RQ_QOS_WBT;
- rwb->rqos.ops = &wbt_rqos_ops;
- rwb->rqos.q = q;
rwb->last_comp = rwb->last_issue = jiffies;
rwb->win_nsec = RWB_WINDOW_NSEC;
rwb->enable_state = WBT_STATE_ON_DEFAULT;
@@ -850,7 +858,7 @@ int wbt_init(struct request_queue *q)
/*
* Assign rwb and add the stats callback.
*/
- rq_qos_add(q, &rwb->rqos);
+ rq_qos_activate(q, &rwb->rqos, &wbt_rqos_ops);
blk_stat_add_callback(q, rwb->cb);

rwb->min_lat_nsec = wbt_default_latency_nsec(q);
@@ -860,3 +868,15 @@ int wbt_init(struct request_queue *q)

return 0;
}
+
+static __init int wbt_mod_init(void)
+{
+ return rq_qos_register(&wbt_rqos_ops);
+}
+
+static __exit void wbt_mod_exit(void)
+{
+ return rq_qos_unregister(&wbt_rqos_ops);
+}
+module_init(wbt_mod_init);
+module_exit(wbt_mod_exit);
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 2eb01becde8c..72e9602df330 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -88,7 +88,7 @@ static inline unsigned int wbt_inflight(struct rq_wb *rwb)

#ifdef CONFIG_BLK_WBT

-int wbt_init(struct request_queue *);
+struct rq_qos *wbt_rq_qos(struct request_queue *q);
void wbt_disable_default(struct request_queue *);
void wbt_enable_default(struct request_queue *);

@@ -101,12 +101,12 @@ u64 wbt_default_latency_nsec(struct request_queue *);

#else

-static inline void wbt_track(struct request *rq, enum wbt_flags flags)
+static inline struct rq_qos *wbt_rq_qos(struct request_queue *q)
{
+ return NULL;
}
-static inline int wbt_init(struct request_queue *q)
+static inline void wbt_track(struct request *rq, enum wbt_flags flags)
{
- return -EINVAL;
}
static inline void wbt_disable_default(struct request_queue *q)
{
--
2.17.1


2022-01-10 17:36:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/13] blk: make blk-rq-qos policies pluggable and modular

On Mon, Jan 10, 2022 at 05:10:33PM +0800, Wang Jianchao wrote:
> This patchset attempts to make blk-rq-qos framework pluggable and modular.

I really don't think making them policies modular is a good thing, and
your new exports/APIs are a very good sign for why it is not a good
idea.

2022-01-11 01:53:43

by Wang Jianchao

[permalink] [raw]
Subject: Re: [PATCH 0/13] blk: make blk-rq-qos policies pluggable and modular



On 2022/1/11 1:36 上午, Christoph Hellwig wrote:
> On Mon, Jan 10, 2022 at 05:10:33PM +0800, Wang Jianchao wrote:
>> This patchset attempts to make blk-rq-qos framework pluggable and modular.
>
> I really don't think making them policies modular is a good thing, and
> your new exports/APIs are a very good sign for why it is not a good
> idea.

Actually, before sent out this version, I didn't make them modular but just
pluggable, and yes, it was because I had to export those interfaces. However,
when I made the patch to introduce a policy which support cgroup and had to
increase the BLKCG_MAX_POLS, it seemed worthy to make the previous policies
modular as we can release the blkcg slot when policy module is not installed.

In addition, our own kernel uses the policies as module. When we recommend iocost
to our customer, they love to see we needn't to reboot machine and even needn't to
stop the IO workload if upgrading of iocost is needed

Thanks
Jianchao

2022-01-11 03:26:06

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 0/13] blk: make blk-rq-qos policies pluggable and modular

On 1/10/22 09:36, Christoph Hellwig wrote:
> On Mon, Jan 10, 2022 at 05:10:33PM +0800, Wang Jianchao wrote:
>> This patchset attempts to make blk-rq-qos framework pluggable and modular.
>
> I really don't think making them policies modular is a good thing, and
> your new exports/APIs are a very good sign for why it is not a good
> idea.

Hi Christoph,

Personally I don't need the ability to implement blk-rq-qos
functionality as a loadable kernel module.

When I implemented the ioprio rq-qos policy (see also blk-ioprio.c) I
noticed that I had to make changes in the block layer core
(blkcg_init_queue(), rq_qos_id_to_name(), blk-rq-qos.h) instead of
having all code related to the new rq-pos policy contained in a single
file. I think it would be an improvement if new rq-qos policies could be
implemented in a single source file and no block layer core changes
would be necessary.

Thanks,

Bart.



2022-01-12 20:15:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 13/13] blk: introduce iostat per cgroup module

On Mon, Jan 10, 2022 at 05:10:46PM +0800, Wang Jianchao wrote:
> From: Wang Jianchao <[email protected]>
>
> iostat can only track the whole device's io statistics. This patch
> introduces iostat per cgroup based on blk-rq-qos framework which
> can track bw, iops, queue latency and device latency and distinguish
> regular or meta data. The blkio.iostat per cgroup output in following
> format,
> vda-data bytes iops queue_lat dev_lat [ditto] [ditto]
> meta \___________ ______________/ | |
> v v v
> read write discard
> In particular, the blkio.iostat of root only output the statistics
> of IOs from root cgroup. However, the non-root blkio.iostat outputs
> all of the children cgroups. With meta stats in root cgroup, hope
> to observe the performace of fs metadata.

I think using bpf is a way better solution for this kind of detailed
statistics. What if I want to know what portions are random, or the
distribution of IO sizes? Do I add another rq-qos policy or add another
interface file with interface versioning?

Thanks.

--
tejun

2022-01-13 01:49:35

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 01/13] blk: make blk-rq-qos support pluggable and modular policy

Hi Wang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on v5.16]
[cannot apply to axboe-block/for-next next-20220112]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Wang-Jianchao/blk-make-blk-rq-qos-policies-pluggable-and-modular/20220110-171347
base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
config: arm-randconfig-r006-20220112 (https://download.01.org/0day-ci/archive/20220113/[email protected]/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 244dd2913a43a200f5a6544d424cdc37b771028b)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm cross compiling tool for clang build
# apt-get install binutils-arm-linux-gnueabi
# https://github.com/0day-ci/linux/commit/8bef9fba59d8d47ecaebbeff3e62ee550d89b017
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Wang-Jianchao/blk-make-blk-rq-qos-policies-pluggable-and-modular/20220110-171347
git checkout 8bef9fba59d8d47ecaebbeff3e62ee550d89b017
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

block/blk-iocost.c:1244:6: warning: variable 'last_period' set but not used [-Wunused-but-set-variable]
u64 last_period, cur_period;
^
>> block/blk-iocost.c:3348:7: warning: variable 'ioc' is uninitialized when used here [-Wuninitialized]
if (!ioc) {
^~~
block/blk-iocost.c:3337:17: note: initialize the variable 'ioc' to silence this warning
struct ioc *ioc;
^
= NULL
2 warnings generated.


vim +/ioc +3348 block/blk-iocost.c

7caa47151ab2e64 Tejun Heo 2019-08-28 3331
7caa47151ab2e64 Tejun Heo 2019-08-28 3332 static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
7caa47151ab2e64 Tejun Heo 2019-08-28 3333 size_t nbytes, loff_t off)
7caa47151ab2e64 Tejun Heo 2019-08-28 3334 {
22ae8ce8b89241c Christoph Hellwig 2020-11-26 3335 struct block_device *bdev;
8bef9fba59d8d47 Wang Jianchao 2022-01-10 3336 struct rq_qos *rqos;
7caa47151ab2e64 Tejun Heo 2019-08-28 3337 struct ioc *ioc;
7caa47151ab2e64 Tejun Heo 2019-08-28 3338 u64 u[NR_I_LCOEFS];
7caa47151ab2e64 Tejun Heo 2019-08-28 3339 bool user;
7caa47151ab2e64 Tejun Heo 2019-08-28 3340 char *p;
7caa47151ab2e64 Tejun Heo 2019-08-28 3341 int ret;
7caa47151ab2e64 Tejun Heo 2019-08-28 3342
22ae8ce8b89241c Christoph Hellwig 2020-11-26 3343 bdev = blkcg_conf_open_bdev(&input);
22ae8ce8b89241c Christoph Hellwig 2020-11-26 3344 if (IS_ERR(bdev))
22ae8ce8b89241c Christoph Hellwig 2020-11-26 3345 return PTR_ERR(bdev);
7caa47151ab2e64 Tejun Heo 2019-08-28 3346
8bef9fba59d8d47 Wang Jianchao 2022-01-10 3347 rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
7caa47151ab2e64 Tejun Heo 2019-08-28 @3348 if (!ioc) {
ed6cddefdfd361a Pavel Begunkov 2021-10-14 3349 ret = blk_iocost_init(bdev_get_queue(bdev));
7caa47151ab2e64 Tejun Heo 2019-08-28 3350 if (ret)
7caa47151ab2e64 Tejun Heo 2019-08-28 3351 goto err;
8bef9fba59d8d47 Wang Jianchao 2022-01-10 3352 rqos = rq_qos_get(bdev_get_queue(bdev), RQ_QOS_COST);
7caa47151ab2e64 Tejun Heo 2019-08-28 3353 }
7caa47151ab2e64 Tejun Heo 2019-08-28 3354
8bef9fba59d8d47 Wang Jianchao 2022-01-10 3355 ioc = rqos_to_ioc(rqos);
7caa47151ab2e64 Tejun Heo 2019-08-28 3356 spin_lock_irq(&ioc->lock);
7caa47151ab2e64 Tejun Heo 2019-08-28 3357 memcpy(u, ioc->params.i_lcoefs, sizeof(u));
7caa47151ab2e64 Tejun Heo 2019-08-28 3358 user = ioc->user_cost_model;
7caa47151ab2e64 Tejun Heo 2019-08-28 3359 spin_unlock_irq(&ioc->lock);
7caa47151ab2e64 Tejun Heo 2019-08-28 3360
7caa47151ab2e64 Tejun Heo 2019-08-28 3361 while ((p = strsep(&input, " \t\n"))) {
7caa47151ab2e64 Tejun Heo 2019-08-28 3362 substring_t args[MAX_OPT_ARGS];
7caa47151ab2e64 Tejun Heo 2019-08-28 3363 char buf[32];
7caa47151ab2e64 Tejun Heo 2019-08-28 3364 int tok;
7caa47151ab2e64 Tejun Heo 2019-08-28 3365 u64 v;
7caa47151ab2e64 Tejun Heo 2019-08-28 3366
7caa47151ab2e64 Tejun Heo 2019-08-28 3367 if (!*p)
7caa47151ab2e64 Tejun Heo 2019-08-28 3368 continue;
7caa47151ab2e64 Tejun Heo 2019-08-28 3369
7caa47151ab2e64 Tejun Heo 2019-08-28 3370 switch (match_token(p, cost_ctrl_tokens, args)) {
7caa47151ab2e64 Tejun Heo 2019-08-28 3371 case COST_CTRL:
7caa47151ab2e64 Tejun Heo 2019-08-28 3372 match_strlcpy(buf, &args[0], sizeof(buf));
7caa47151ab2e64 Tejun Heo 2019-08-28 3373 if (!strcmp(buf, "auto"))
7caa47151ab2e64 Tejun Heo 2019-08-28 3374 user = false;
7caa47151ab2e64 Tejun Heo 2019-08-28 3375 else if (!strcmp(buf, "user"))
7caa47151ab2e64 Tejun Heo 2019-08-28 3376 user = true;
7caa47151ab2e64 Tejun Heo 2019-08-28 3377 else
7caa47151ab2e64 Tejun Heo 2019-08-28 3378 goto einval;
7caa47151ab2e64 Tejun Heo 2019-08-28 3379 continue;
7caa47151ab2e64 Tejun Heo 2019-08-28 3380 case COST_MODEL:
7caa47151ab2e64 Tejun Heo 2019-08-28 3381 match_strlcpy(buf, &args[0], sizeof(buf));
7caa47151ab2e64 Tejun Heo 2019-08-28 3382 if (strcmp(buf, "linear"))
7caa47151ab2e64 Tejun Heo 2019-08-28 3383 goto einval;
7caa47151ab2e64 Tejun Heo 2019-08-28 3384 continue;
7caa47151ab2e64 Tejun Heo 2019-08-28 3385 }
7caa47151ab2e64 Tejun Heo 2019-08-28 3386
7caa47151ab2e64 Tejun Heo 2019-08-28 3387 tok = match_token(p, i_lcoef_tokens, args);
7caa47151ab2e64 Tejun Heo 2019-08-28 3388 if (tok == NR_I_LCOEFS)
7caa47151ab2e64 Tejun Heo 2019-08-28 3389 goto einval;
7caa47151ab2e64 Tejun Heo 2019-08-28 3390 if (match_u64(&args[0], &v))
7caa47151ab2e64 Tejun Heo 2019-08-28 3391 goto einval;
7caa47151ab2e64 Tejun Heo 2019-08-28 3392 u[tok] = v;
7caa47151ab2e64 Tejun Heo 2019-08-28 3393 user = true;
7caa47151ab2e64 Tejun Heo 2019-08-28 3394 }
7caa47151ab2e64 Tejun Heo 2019-08-28 3395
7caa47151ab2e64 Tejun Heo 2019-08-28 3396 spin_lock_irq(&ioc->lock);
7caa47151ab2e64 Tejun Heo 2019-08-28 3397 if (user) {
7caa47151ab2e64 Tejun Heo 2019-08-28 3398 memcpy(ioc->params.i_lcoefs, u, sizeof(u));
7caa47151ab2e64 Tejun Heo 2019-08-28 3399 ioc->user_cost_model = true;
7caa47151ab2e64 Tejun Heo 2019-08-28 3400 } else {
7caa47151ab2e64 Tejun Heo 2019-08-28 3401 ioc->user_cost_model = false;
7caa47151ab2e64 Tejun Heo 2019-08-28 3402 }
7caa47151ab2e64 Tejun Heo 2019-08-28 3403 ioc_refresh_params(ioc, true);
7caa47151ab2e64 Tejun Heo 2019-08-28 3404 spin_unlock_irq(&ioc->lock);
7caa47151ab2e64 Tejun Heo 2019-08-28 3405
8bef9fba59d8d47 Wang Jianchao 2022-01-10 3406 rq_qos_put(rqos);
22ae8ce8b89241c Christoph Hellwig 2020-11-26 3407 blkdev_put_no_open(bdev);
7caa47151ab2e64 Tejun Heo 2019-08-28 3408 return nbytes;
7caa47151ab2e64 Tejun Heo 2019-08-28 3409
7caa47151ab2e64 Tejun Heo 2019-08-28 3410 einval:
7caa47151ab2e64 Tejun Heo 2019-08-28 3411 ret = -EINVAL;
8bef9fba59d8d47 Wang Jianchao 2022-01-10 3412 rq_qos_put(rqos);
7caa47151ab2e64 Tejun Heo 2019-08-28 3413 err:
22ae8ce8b89241c Christoph Hellwig 2020-11-26 3414 blkdev_put_no_open(bdev);
7caa47151ab2e64 Tejun Heo 2019-08-28 3415 return ret;
7caa47151ab2e64 Tejun Heo 2019-08-28 3416 }
7caa47151ab2e64 Tejun Heo 2019-08-28 3417

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

2022-01-13 02:40:45

by Wang Jianchao

[permalink] [raw]
Subject: Re: [PATCH 13/13] blk: introduce iostat per cgroup module



On 2022/1/13 4:13 上午, Tejun Heo wrote:
> On Mon, Jan 10, 2022 at 05:10:46PM +0800, Wang Jianchao wrote:
>> From: Wang Jianchao <[email protected]>
>>
>> iostat can only track the whole device's io statistics. This patch
>> introduces iostat per cgroup based on blk-rq-qos framework which
>> can track bw, iops, queue latency and device latency and distinguish
>> regular or meta data. The blkio.iostat per cgroup output in following
>> format,
>> vda-data bytes iops queue_lat dev_lat [ditto] [ditto]
>> meta \___________ ______________/ | |
>> v v v
>> read write discard
>> In particular, the blkio.iostat of root only output the statistics
>> of IOs from root cgroup. However, the non-root blkio.iostat outputs
>> all of the children cgroups. With meta stats in root cgroup, hope
>> to observe the performace of fs metadata.
> > I think using bpf is a way better solution for this kind of detailed
> statistics.
bw/iops/lat of data or metadata of one cgroup is very basic statistics
which kernel could provide especially when cgroup is employed everywhere.
And we love to collect them all the time during the instance in cgroup is
running.
> What if I want to know what portions are random, or the
> distribution of IO sizes?
This looks really detailed statistics :)
> Do I add another rq-qos policy or add another
> interface file with interface versioning?
This iostat module can not provide all the kinds of statistics we want
but just some very basic things. And maybe it can provide better hooks
to install the ebpf program to collect detailed statistics.

Best regards
Jianchao

2022-01-13 03:52:44

by Wang Jianchao

[permalink] [raw]
Subject: Re: [PATCH 01/13] blk: make blk-rq-qos support pluggable and modular policy



On 2022/1/13 9:49 上午, kernel test robot wrote:
> ll warnings (new ones prefixed by >>):
>
> block/blk-iocost.c:1244:6: warning: variable 'last_period' set but not used [-Wunused-but-set-variable]
> u64 last_period, cur_period;
> ^
>>> block/blk-iocost.c:3348:7: warning: variable 'ioc' is uninitialized when used here [-Wuninitialized]
> if (!ioc) {

Thanks so much
I will fix this in next patch version.

Jianchao

2022-01-13 17:01:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 13/13] blk: introduce iostat per cgroup module

Hello,

On Thu, Jan 13, 2022 at 10:40:27AM +0800, Wang Jianchao wrote:
> bw/iops/lat of data or metadata of one cgroup is very basic statistics

bw and iops can already be derived from the cumulative counters in io.stat.
Latencies without distribution are often what we tradtionally exposed not
because they're great but because exposing distributions is so cumbersome
from kernel. Also, latency numbers at cgroup level isn't *that* useful
anyway. There isn't a lot you can deduce for the particular cgroup from that
number.

> which kernel could provide especially when cgroup is employed everywhere.
> And we love to collect them all the time during the instance in cgroup is
> running.

It's really not that difficult to collect these numbers from bpf with pretty
low overhead. There's the problem of deployment for which there isn't "the"
right and convenient way but it'd be more worthwhile to put efforts towards
that.

> > What if I want to know what portions are random, or the
> > distribution of IO sizes?
> This looks really detailed statistics :)
> > Do I add another rq-qos policy or add another
> > interface file with interface versioning?
> This iostat module can not provide all the kinds of statistics we want
> but just some very basic things. And maybe it can provide better hooks
> to install the ebpf program to collect detailed statistics.

I mean, "really basic" means different things for different folks. Bytes /
ios, we all seem to agree. Beyond that, who knows? There already are enough
hooks to collect stats that you're trying to collect. The examples in py-bcc
might be a good place to start.

Thanks.

--
tejun

2022-01-14 02:01:38

by Wang Jianchao

[permalink] [raw]
Subject: Re: [PATCH 13/13] blk: introduce iostat per cgroup module

Hi Tejun

Thanks so much your comment :)
I really appreciate it.

On 2022/1/14 1:01 上午, Tejun Heo wrote:
> Hello,
>
> On Thu, Jan 13, 2022 at 10:40:27AM +0800, Wang Jianchao wrote:
>> bw/iops/lat of data or metadata of one cgroup is very basic statistics
>
> bw and iops can already be derived from the cumulative counters in io.stat.
> Latencies without distribution are often what we tradtionally exposed not
> because they're great but because exposing distributions is so cumbersome
> from kernel. Also, latency numbers at cgroup level isn't *that* useful
> anyway. There isn't a lot you can deduce for the particular cgroup from that
> number.
I agree with this. But my customer would yell at me because they really want to
see that....
>
>> which kernel could provide especially when cgroup is employed everywhere.
>> And we love to collect them all the time during the instance in cgroup is
>> running.
>
> It's really not that difficult to collect these numbers from bpf with pretty
> low overhead. There's the problem of deployment for which there isn't "the"
> right and convenient way but it'd be more worthwhile to put efforts towards
> that.
>
>>> What if I want to know what portions are random, or the
>>> distribution of IO sizes?
>> This looks really detailed statistics :)
>>> Do I add another rq-qos policy or add another
>>> interface file with interface versioning?
>> This iostat module can not provide all the kinds of statistics we want
>> but just some very basic things. And maybe it can provide better hooks
>> to install the ebpf program to collect detailed statistics.
>
> I mean, "really basic" means different things for different folks. Bytes /
> ios, we all seem to agree. Beyond that, who knows? There already are enough
> hooks to collect stats that you're trying to collect. The examples in py-bcc
> might be a good place to start.
Thanks for your suggestion. I will read it and try with bpf

Best Regards
Jianchao