2018-06-05 13:35:00

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 00/13][V2] Introduce io.latency io controller for cgroups

v1->v2:
- fix how we get the swap device for the page when doing the swap throttling.
- add a bunch of comments how the throttling works.
- move the documentation to cgroup-v2.txt
- address the various other comments.

==== Original message =====

This series adds a latency based io controller for cgroups. It is based on the
same concept as the writeback throttling code, which is watching the overall
total latency of IO's in a given window and then adjusting the queue depth of
the group accordingly. This is meant to be a workload protection controller, so
whoever has the lowest latency target gets the preferential treatment with no
thought to fairness or proportionality. It is meant to be work conserving, so
as long as nobody is missing their latency targets the disk is fair game.

We have been testing this in production for several months now to get the
behavior right and we are finally at the point that it is working well in all of
our test cases. With this patch we protect our main workload (the web server)
and isolate out the system services (chef/yum/etc). This works well in the
normal case, smoothing out weird request per second (RPS) dips that we would see
when one of the system services would run and compete for IO resources. This
also works incredibly well in the runaway task case.

The runaway task usecase is where we have some task that slowly eats up all of
the memory on the system (think a memory leak). Previously this sort of
workload would push the box into a swapping/oom death spiral that was only
recovered by rebooting the box. With this patchset and proper configuration of
the memory.low and io.latency controllers we're able to survive this test with a
at most 20% dip in RPS.

There are a lot of extra patches in here to set everything up. The following
are just infrastructure that should be relatively uncontroversial

[PATCH 01/13] block: add bi_blkg to the bio for cgroups
[PATCH 02/13] block: introduce bio_issue_as_root_blkg
[PATCH 03/13] blk-cgroup: allow controllers to output their own stats

The following simply allow us to tag swap IO and assign the appropriate cgroup
to the bio's so we can do the appropriate accounting inside the io controller

[PATCH 04/13] blk: introduce REQ_SWAP
[PATCH 05/13] swap,blkcg: issue swap io with the appropriate context

This is so that we can induce delays. The io controller mostly throttles based
on queue depth, however for cases like REQ_SWAP/REQ_META where we cannot
throttle without inducing a priority inversion we have a mechanism to "back
charge" groups for this IO by inducing an artificial delay at user space return
time.

[PATCH 06/13] blkcg: add generic throttling mechanism
[PATCH 07/13] memcontrol: schedule throttling if we are congested

This is more moving things around and refactoring, Jens you may want to pay
close attention to this to make sure I didn't break anything.

[PATCH 08/13] blk-stat: export helpers for modifying blk_rq_stat
[PATCH 09/13] blk-rq-qos: refactor out common elements of blk-wbt
[PATCH 10/13] block: remove external dependency on wbt_flags
[PATCH 11/13] rq-qos: introduce dio_bio callback

And this is the meat of the controller and it's documentation.

[PATCH 12/13] block: introduce blk-iolatency io controller
[PATCH 13/13] Documentation: add a doc for blk-iolatency

Jens, I'm sending this through your tree since it's mostly block related,
however there are the two mm related patches, so if somebody from mm could weigh
in on how we want to handle those that would be great. Thanks,

Josef


2018-06-05 13:30:52

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 05/13] swap,blkcg: issue swap io with the appropriate context

From: Tejun Heo <[email protected]>

For backcharging we need to know who the page belongs to when swapping
it out.

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
---
block/bio.c | 24 ++++++++++++++++++++++++
include/linux/bio.h | 7 +++++++
mm/page_io.c | 1 +
3 files changed, 32 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 57c4b1986e76..c77fe1b4caa8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2005,6 +2005,30 @@ EXPORT_SYMBOL(bioset_create);

#ifdef CONFIG_BLK_CGROUP

+#ifdef CONFIG_MEMCG
+/**
+ * bio_associate_blkcg_from_page - associate a bio with the page's blkcg
+ * @bio: target bio
+ * @page: the page to lookup the blkcg from
+ *
+ * Associate @bio with the blkcg from @page's owning memcg. This works like
+ * every other associate function wrt references.
+ */
+int bio_associate_blkcg_from_page(struct bio *bio, struct page *page)
+{
+ struct cgroup_subsys_state *blkcg_css;
+
+ if (unlikely(bio->bi_css))
+ return -EBUSY;
+ if (!page->mem_cgroup)
+ return 0;
+ blkcg_css = cgroup_get_e_css(page->mem_cgroup->css.cgroup,
+ &io_cgrp_subsys);
+ bio->bi_css = blkcg_css;
+ return 0;
+}
+#endif /* CONFIG_MEMCG */
+
/**
* bio_associate_blkcg - associate a bio with the specified blkcg
* @bio: target bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f2f3f1428e81..eef1e8877048 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -545,6 +545,13 @@ do { \
#define bio_dev(bio) \
disk_devt((bio)->bi_disk)

+#if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
+int bio_associate_blkcg_from_page(struct bio *bio, struct page *page);
+#else
+static inline int bio_associate_blkcg_from_page(struct bio *bio,
+ struct page *page) { return 0; }
+#endif
+
#ifdef CONFIG_BLK_CGROUP
int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css);
int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg);
diff --git a/mm/page_io.c b/mm/page_io.c
index a552cb37e220..aafd19ec1db4 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -339,6 +339,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
goto out;
}
bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc);
+ bio_associate_blkcg_from_page(bio, page);
count_swpout_vm_event(page);
set_page_writeback(page);
unlock_page(page);
--
2.14.3


2018-06-05 13:30:55

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 08/13] blk-stat: export helpers for modifying blk_rq_stat

From: Josef Bacik <[email protected]>

We need to use blk_rq_stat in the blkcg qos stuff, so export some of
these helpers so they can be used by other things.

Signed-off-by: Josef Bacik <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
block/blk-stat.c | 16 ++++++++--------
block/blk-stat.h | 4 ++++
2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 175c143ac5b9..7587b1c3caaf 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -17,7 +17,7 @@ struct blk_queue_stats {
bool enable_accounting;
};

-static void blk_stat_init(struct blk_rq_stat *stat)
+void blk_rq_stat_init(struct blk_rq_stat *stat)
{
stat->min = -1ULL;
stat->max = stat->nr_samples = stat->mean = 0;
@@ -25,7 +25,7 @@ static void blk_stat_init(struct blk_rq_stat *stat)
}

/* src is a per-cpu stat, mean isn't initialized */
-static void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
+void blk_rq_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
{
if (!src->nr_samples)
return;
@@ -39,7 +39,7 @@ static void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
dst->nr_samples += src->nr_samples;
}

-static void __blk_stat_add(struct blk_rq_stat *stat, u64 value)
+void blk_rq_stat_add(struct blk_rq_stat *stat, u64 value)
{
stat->min = min(stat->min, value);
stat->max = max(stat->max, value);
@@ -69,7 +69,7 @@ void blk_stat_add(struct request *rq, u64 now)
continue;

stat = &get_cpu_ptr(cb->cpu_stat)[bucket];
- __blk_stat_add(stat, value);
+ blk_rq_stat_add(stat, value);
put_cpu_ptr(cb->cpu_stat);
}
rcu_read_unlock();
@@ -82,15 +82,15 @@ static void blk_stat_timer_fn(struct timer_list *t)
int cpu;

for (bucket = 0; bucket < cb->buckets; bucket++)
- blk_stat_init(&cb->stat[bucket]);
+ blk_rq_stat_init(&cb->stat[bucket]);

for_each_online_cpu(cpu) {
struct blk_rq_stat *cpu_stat;

cpu_stat = per_cpu_ptr(cb->cpu_stat, cpu);
for (bucket = 0; bucket < cb->buckets; bucket++) {
- blk_stat_sum(&cb->stat[bucket], &cpu_stat[bucket]);
- blk_stat_init(&cpu_stat[bucket]);
+ blk_rq_stat_sum(&cb->stat[bucket], &cpu_stat[bucket]);
+ blk_rq_stat_init(&cpu_stat[bucket]);
}
}

@@ -143,7 +143,7 @@ void blk_stat_add_callback(struct request_queue *q,

cpu_stat = per_cpu_ptr(cb->cpu_stat, cpu);
for (bucket = 0; bucket < cb->buckets; bucket++)
- blk_stat_init(&cpu_stat[bucket]);
+ blk_rq_stat_init(&cpu_stat[bucket]);
}

spin_lock(&q->stats->lock);
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 78399cdde9c9..f4a1568e81a4 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -159,4 +159,8 @@ static inline void blk_stat_activate_msecs(struct blk_stat_callback *cb,
mod_timer(&cb->timer, jiffies + msecs_to_jiffies(msecs));
}

+void blk_rq_stat_add(struct blk_rq_stat *, u64);
+void blk_rq_stat_sum(struct blk_rq_stat *, struct blk_rq_stat *);
+void blk_rq_stat_init(struct blk_rq_stat *);
+
#endif
--
2.14.3


2018-06-05 13:31:25

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 12/13] block: introduce blk-iolatency io controller

From: Josef Bacik <[email protected]>

Current IO controllers for the block layer are less than ideal for our
use case. The io.max controller is great at hard limiting, but it is
not work conserving. This patch introduces io.latency. You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets. This makes use of
the rq-qos infrastructure and works much like the wbt stuff. There are
a few differences from wbt

- It's bio based, so the latency covers the whole block layer in addition to
the actual io.
- We will throttle all IO types that comes in here if we need to.
- We use the mean latency over the 100ms window. This is because writes can
be particularly fast, which could give us a false sense of the impact of
other workloads on our protected workload.
- By default there's no throttling, we set the queue_depth to INT_MAX so that
we can have as many outstanding bio's as we're allowed to. Only at
throttle time do we pay attention to the actual queue depth.
- We backcharge cgroups for root cg issued IO and induce artificial
delays in order to deal with cases like metadata only or swap heavy
workloads.

In testing this has worked out relatively well. Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).

Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group. We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.

Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all. With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.

Signed-off-by: Josef Bacik <[email protected]>
---
block/Kconfig | 12 +
block/Makefile | 1 +
block/blk-iolatency.c | 840 ++++++++++++++++++++++++++++++++++++++++++++++
block/blk-sysfs.c | 2 +
block/blk.h | 6 +
include/linux/blk_types.h | 2 -
6 files changed, 861 insertions(+), 2 deletions(-)
create mode 100644 block/blk-iolatency.c

diff --git a/block/Kconfig b/block/Kconfig
index 28ec55752b68..c3205b2003a0 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -149,6 +149,18 @@ config BLK_WBT
dynamically on an algorithm loosely based on CoDel, factoring in
the realtime performance of the disk.

+config BLK_CGROUP_IOLATENCY
+ bool "Enable support for latency based cgroup IO protection"
+ depends on BLK_CGROUP=y
+ default n
+ ---help---
+ Enabling this option enables the .latency interface for IO throttling.
+ The IO controller will attempt to maintain average io latencies below
+ the configured latency target, throttling anybody with a higher latency
+ target than the victimized group.
+
+ Note, this is an experimental interface and could be changed someday.
+
config BLK_WBT_SQ
bool "Single queue writeback throttling"
default n
diff --git a/block/Makefile b/block/Makefile
index 981605042cad..645fdabe6a8e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
+obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
new file mode 100644
index 000000000000..75b7ef65ddb6
--- /dev/null
+++ b/block/blk-iolatency.c
@@ -0,0 +1,840 @@
+/*
+ * Block rq-qos base io controller
+ *
+ * This works similar to wbt with a few exceptions
+ *
+ * - It's bio based, so the latency covers the whole block layer in addition to
+ * the actual io.
+ * - We will throttle all IO that comes in here if we need to.
+ * - We use the mean latency over the 100ms window. This is because writes can
+ * be particularly fast, which could give us a false sense of the impact of
+ * other workloads on our protected workload.
+ * - By default there's no throttling, we set the queue_depth to INT_MAX so that
+ * we can have as many outstanding bio's as we're allowed to. Only at
+ * throttle time do we pay attention to the actual queue depth.
+ *
+ * The hierarchy works like the cpu controller does, we track the latency at
+ * every configured node, and each configured node has it's own independent
+ * queue depth. This means that we only care about our latency targets at the
+ * peer level. Some group at the bottom of the hierarchy isn't going to affect
+ * a group at the end of some other path if we're only configred at leaf level.
+ *
+ * Consider the following
+ *
+ * root blkg
+ * / \
+ * fast (target=5ms) slow (target=10ms)
+ * / \ / \
+ * a b normal(15ms) unloved
+ *
+ * "a" and "b" have no target, but their combined io under "fast" cannot exceed
+ * an average latency of 5ms. If it does then we will throttle the "slow"
+ * group. In the case of "normal", if it exceeds its 15ms target, we will
+ * throttle "unloved", but nobody else.
+ *
+ * In this example "fast", "slow", and "normal" will be the only groups actually
+ * accounting their io latencies. We have to walk up the heirarchy to the root
+ * on every submit and complete so we can do the appropriate stat recording and
+ * adjust the queue depth of ourselves if needed.
+ *
+ * There are 2 ways we throttle IO.
+ *
+ * 1) Queue depth throttling. As we throttle down we will adjust the maximum
+ * number of IO's we're allowed to have in flight. This starts at (u64)-1 down
+ * to 1. If the group is only ever submitting IO for itself then this is the
+ * only way we throttle.
+ *
+ * 2) Induced delay throttling. This is for the case that a group is generating
+ * IO that has to be issued by the root cg to avoid priority inversion. So think
+ * REQ_META or REQ_SWAP. If we are already at qd == 1 and we're getting a lot
+ * of work done for us on behalf of the root cg and are being asked to scale
+ * down more then we induce a latency at userspace return. We accumulate the
+ * total amount of time we need to be punished by doing
+ *
+ * total_time += min_lat_nsec - actual_io_completion
+ *
+ * and then at throttle time will do
+ *
+ * throttle_time = min(total_time, NSEC_PER_SEC)
+ *
+ * This induced delay will throttle back the activity that is generating the
+ * root cg issued io's, wethere that's some metadata intensive operation or the
+ * group is using so much memory that it is pushing us into swap.
+ *
+ * Copyright (C) 2018 Josef Bacik
+ */
+#include <linux/kernel.h>
+#include <linux/blk_types.h>
+#include <linux/backing-dev.h>
+#include <linux/module.h>
+#include <linux/timer.h>
+#include <linux/memcontrol.h>
+#include <trace/events/block.h>
+#include "blk-wbt.h"
+
+#define DEFAULT_SCALE_COOKIE 1000000U
+
+static struct blkcg_policy blkcg_policy_iolatency;
+struct iolatency_grp;
+
+struct blk_iolatency {
+ struct rq_qos rqos;
+ struct timer_list timer;
+ atomic_t enabled;
+};
+
+static inline struct blk_iolatency *BLKIOLATENCY(struct rq_qos *rqos)
+{
+ return container_of(rqos, struct blk_iolatency, rqos);
+}
+
+static inline bool blk_iolatency_enabled(struct blk_iolatency *blkiolat)
+{
+ return atomic_read(&blkiolat->enabled) > 0;
+}
+
+struct child_latency_info {
+ spinlock_t lock;
+
+ /* Last time we adjusted the scale of everybody. */
+ u64 last_scale_event;
+
+ /* The latency that we missed. */
+ u64 scale_lat;
+
+ /* The guy who actually changed the latency numbers. */
+ struct iolatency_grp *scale_grp;
+
+ /* Cookie to tell if we need to scale up or down. */
+ atomic_t scale_cookie;
+};
+
+struct iolatency_grp {
+ struct blkg_policy_data pd;
+ struct blk_rq_stat __percpu *stats;
+ struct blk_iolatency *blkiolat;
+ struct rq_depth rq_depth;
+ struct rq_wait rq_wait;
+ atomic64_t window_start;
+ atomic_t scale_cookie;
+ u64 min_lat_nsec;
+ u64 cur_win_nsec;
+
+ /* total running average of our io latency. */
+ u64 total_lat_avg;
+ u64 total_lat_nr;
+
+ struct child_latency_info child_lat;
+};
+
+static inline struct iolatency_grp *pd_to_lat(struct blkg_policy_data *pd)
+{
+ return container_of(pd, struct iolatency_grp, pd);
+}
+
+static inline struct iolatency_grp *blkg_to_lat(struct blkcg_gq *blkg)
+{
+ return pd_to_lat(blkg_to_pd(blkg, &blkcg_policy_iolatency));
+}
+
+static inline struct blkcg_gq *lat_to_blkg(struct iolatency_grp *iolat)
+{
+ return pd_to_blkg(&iolat->pd);
+}
+
+static inline bool iolatency_may_queue(struct iolatency_grp *iolat,
+ wait_queue_entry_t *wait)
+{
+ struct rq_wait *rqw = &iolat->rq_wait;
+
+ if (waitqueue_active(&rqw->wait) &&
+ rqw->wait.head.next != &wait->entry)
+ return false;
+ return rq_wait_inc_below(rqw, iolat->rq_depth.max_depth);
+}
+
+static void __blkcg_iolatency_throttle(struct rq_qos *rqos,
+ struct iolatency_grp *iolat,
+ spinlock_t *lock, bool issue_as_root,
+ bool use_memdelay)
+ __releases(lock)
+ __acquires(lock)
+{
+ struct rq_wait *rqw = &iolat->rq_wait;
+ unsigned use_delay = atomic_read(&lat_to_blkg(iolat)->use_delay);
+ DEFINE_WAIT(wait);
+
+ if (use_delay)
+ blkcg_schedule_throttle(rqos->q, use_memdelay);
+
+ /* We don't wait for a qd slot, we just take what we want. */
+ if (issue_as_root) {
+ atomic_inc(&rqw->inflight);
+ return;
+ }
+
+ if (iolatency_may_queue(iolat, &wait))
+ return;
+
+ do {
+ prepare_to_wait_exclusive(&rqw->wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+ if (iolatency_may_queue(iolat, &wait))
+ break;
+
+ if (lock) {
+ spin_unlock_irq(lock);
+ io_schedule();
+ spin_lock_irq(lock);
+ } else {
+ io_schedule();
+ }
+ } while (1);
+
+ finish_wait(&rqw->wait, &wait);
+}
+
+#define SCALE_DOWN_FACTOR 2
+#define SCALE_UP_FACTOR 4
+
+static inline unsigned long scale_amount(unsigned long qd, bool up)
+{
+ return max(up ? qd >> SCALE_UP_FACTOR : qd >> SCALE_DOWN_FACTOR, 1UL);
+}
+
+/*
+ * We scale the qd down faster than we scale up, so we need to use this helper
+ * to adjust the scale_cookie accordingly so we don't prematurely get
+ * scale_cookie at DEFAULT_SCALE_COOKIE and unthrottle too much.
+ */
+static void scale_cookie_change(struct blk_iolatency *blkiolat,
+ struct child_latency_info *lat_info,
+ bool up)
+{
+ unsigned long qd = blk_queue_depth(blkiolat->rqos.q);
+ unsigned long scale = scale_amount(qd, up);
+ unsigned long old = atomic_read(&lat_info->scale_cookie);
+
+ if (up) {
+ if (scale + old > DEFAULT_SCALE_COOKIE)
+ atomic_set(&lat_info->scale_cookie,
+ DEFAULT_SCALE_COOKIE);
+ else if (DEFAULT_SCALE_COOKIE - old > qd)
+ atomic_inc(&lat_info->scale_cookie);
+ else
+ atomic_add(scale, &lat_info->scale_cookie);
+ } else {
+ if (DEFAULT_SCALE_COOKIE - old > qd)
+ atomic_dec(&lat_info->scale_cookie);
+ else
+ atomic_sub(scale, &lat_info->scale_cookie);
+ }
+}
+
+/*
+ * Change the queue depth of the iolatency_grp. We add/subtract 1/16th of the
+ * queue depth at a time so we don't get wild swings and hopefully dial in to
+ * fairer distribution of the overall queue depth.
+ */
+static void scale_change(struct iolatency_grp *iolat, bool up)
+{
+ unsigned long qd = blk_queue_depth(iolat->blkiolat->rqos.q);
+ unsigned long scale = scale_amount(qd, up);
+ unsigned long old = iolat->rq_depth.max_depth;
+ bool changed = false;
+
+ if (old > qd)
+ old = qd;
+
+ if (up) {
+ if (old == 1 && blkcg_unuse_delay(lat_to_blkg(iolat)))
+ return;
+
+ if (old < qd) {
+ changed = true;
+ old += scale;
+ old = min(old, qd);
+ iolat->rq_depth.max_depth = old;
+ wake_up_all(&iolat->rq_wait.wait);
+ }
+ } else if (old > 1) {
+ old >>= 1;
+ changed = true;
+ iolat->rq_depth.max_depth = max(old, 1UL);
+ }
+}
+
+/* Check our parent and see if the scale cookie has changed. */
+static void check_scale_change(struct iolatency_grp *iolat)
+{
+ struct iolatency_grp *parent;
+ struct child_latency_info *lat_info;
+ unsigned int cur_cookie;
+ unsigned int our_cookie = atomic_read(&iolat->scale_cookie);
+ u64 scale_lat;
+ unsigned int old;
+ int direction = 0;
+
+ if (lat_to_blkg(iolat)->parent == NULL)
+ return;
+
+ parent = blkg_to_lat(lat_to_blkg(iolat)->parent);
+ lat_info = &parent->child_lat;
+ cur_cookie = atomic_read(&lat_info->scale_cookie);
+ scale_lat = READ_ONCE(lat_info->scale_lat);
+
+ if (cur_cookie < our_cookie)
+ direction = -1;
+ else if (cur_cookie > our_cookie)
+ direction = 1;
+ else
+ return;
+
+ old = atomic_cmpxchg(&iolat->scale_cookie, our_cookie, cur_cookie);
+
+ /* Somebody beat us to the punch, just bail. */
+ if (old != our_cookie)
+ return;
+
+ if (direction < 0 && (!scale_lat || iolat->min_lat_nsec <= scale_lat))
+ return;
+
+ /* We're as low as we can go. */
+ if (iolat->rq_depth.max_depth == 1 && direction < 0) {
+ blkcg_use_delay(lat_to_blkg(iolat));
+ return;
+ }
+
+ /* We're back to the default cookie, unthrottle all the things. */
+ if (cur_cookie == DEFAULT_SCALE_COOKIE) {
+ blkcg_clear_delay(lat_to_blkg(iolat));
+ iolat->rq_depth.max_depth = INT_MAX;
+ wake_up_all(&iolat->rq_wait.wait);
+ return;
+ }
+
+ scale_change(iolat, direction > 0);
+}
+
+static void blkcg_iolatency_throttle(struct rq_qos *rqos, struct bio *bio,
+ spinlock_t *lock)
+{
+ struct blk_iolatency *blkiolat = BLKIOLATENCY(rqos);
+ struct blkcg *blkcg;
+ struct blkcg_gq *blkg;
+ struct request_queue *q = rqos->q;
+ bool issue_as_root = bio_issue_as_root_blkg(bio);
+
+ if (!blk_iolatency_enabled(blkiolat))
+ return;
+
+ rcu_read_lock();
+ blkcg = bio_blkcg(bio);
+ bio_associate_blkcg(bio, &blkcg->css);
+ blkg = blkg_lookup(blkcg, q);
+ if (unlikely(!blkg)) {
+ if (!lock)
+ spin_lock_irq(q->queue_lock);
+ blkg = blkg_lookup_create(blkcg, q);
+ if (IS_ERR(blkg))
+ blkg = NULL;
+ if (!lock)
+ spin_unlock_irq(q->queue_lock);
+ }
+ if (!blkg)
+ goto out;
+
+ bio_issue_init(&bio->bi_issue, bio_sectors(bio));
+ bio_associate_blkg(bio, blkg);
+out:
+ rcu_read_unlock();
+ while (blkg && blkg->parent) {
+ struct iolatency_grp *iolat = blkg_to_lat(blkg);
+
+ check_scale_change(iolat);
+ __blkcg_iolatency_throttle(rqos, iolat, lock, issue_as_root,
+ (bio->bi_opf & REQ_SWAP) == REQ_SWAP);
+ blkg = blkg->parent;
+ }
+ if (!timer_pending(&blkiolat->timer))
+ mod_timer(&blkiolat->timer, jiffies + HZ);
+}
+
+static void iolatency_record_time(struct iolatency_grp *iolat,
+ struct bio_issue *issue, u64 now,
+ bool issue_as_root)
+{
+ struct blk_rq_stat *rq_stat;
+ u64 start = bio_issue_time(issue);
+
+ if (now <= start)
+ return;
+
+ /*
+ * We don't want to count issue_as_root bio's in the cgroups latency
+ * statistics as it could skew the numbers downwards.
+ */
+ if (unlikely(issue_as_root && iolat->rq_depth.max_depth != (u64)-1)) {
+ u64 sub = iolat->min_lat_nsec;
+ now -= start;
+ if (now < sub)
+ blkcg_add_delay(lat_to_blkg(iolat), now, sub - now);
+ return;
+ }
+
+ rq_stat = get_cpu_ptr(iolat->stats);
+ blk_rq_stat_add(rq_stat, now - start);
+ put_cpu_ptr(rq_stat);
+}
+
+#define BLKIOLATENCY_MIN_ADJUST_TIME (500 * NSEC_PER_MSEC)
+#define BLKIOLATENCY_MIN_GOOD_SAMPLES 5
+
+static void iolatency_check_latencies(struct iolatency_grp *iolat, u64 now)
+{
+ struct blkcg_gq *blkg = lat_to_blkg(iolat);
+ struct iolatency_grp *parent;
+ struct child_latency_info *lat_info;
+ struct blk_rq_stat stat;
+ unsigned long flags;
+ int cpu;
+
+ blk_rq_stat_init(&stat);
+ preempt_disable();
+ for_each_online_cpu(cpu) {
+ struct blk_rq_stat *s;
+ s = per_cpu_ptr(iolat->stats, cpu);
+ blk_rq_stat_sum(&stat, s);
+ blk_rq_stat_init(s);
+ }
+ preempt_enable();
+
+ /*
+ * Our average exceeded our window, scale up our window so we are more
+ * accurate, but not more than the global timer.
+ */
+ if (stat.mean > iolat->cur_win_nsec) {
+ iolat->cur_win_nsec <<= 1;
+ iolat->cur_win_nsec =
+ max_t(u64, iolat->cur_win_nsec, NSEC_PER_SEC);
+ }
+
+ if (!blkg->parent)
+ return;
+
+ parent = blkg_to_lat(blkg->parent);
+ lat_info = &parent->child_lat;
+
+ iolat->total_lat_avg =
+ div64_u64((iolat->total_lat_avg * iolat->total_lat_nr) +
+ stat.mean, iolat->total_lat_nr + 1);
+
+ iolat->total_lat_nr++;
+
+ /* Everything is ok and we don't need to adjust the scale. */
+ if (stat.mean <= iolat->min_lat_nsec &&
+ atomic_read(&lat_info->scale_cookie) == DEFAULT_SCALE_COOKIE)
+ return;
+
+ /* Somebody beat us to the punch, just bail. */
+ spin_lock_irqsave(&lat_info->lock, flags);
+ if ((lat_info->last_scale_event >= now ||
+ now - lat_info->last_scale_event < BLKIOLATENCY_MIN_ADJUST_TIME) &&
+ lat_info->scale_lat <= iolat->min_lat_nsec)
+ goto out;
+
+ if (stat.mean <= iolat->min_lat_nsec &&
+ stat.nr_samples >= BLKIOLATENCY_MIN_GOOD_SAMPLES) {
+ lat_info->last_scale_event = now;
+ if (lat_info->scale_grp == iolat)
+ scale_cookie_change(iolat->blkiolat, lat_info, true);
+ } else if (stat.mean > iolat->min_lat_nsec) {
+ lat_info->last_scale_event = now;
+ if (!lat_info->scale_grp ||
+ lat_info->scale_lat > iolat->min_lat_nsec) {
+ WRITE_ONCE(lat_info->scale_lat, iolat->min_lat_nsec);
+ lat_info->scale_grp = iolat;
+ }
+ scale_cookie_change(iolat->blkiolat, lat_info, false);
+ }
+out:
+ spin_unlock_irqrestore(&lat_info->lock, flags);
+}
+
+static void blkcg_iolatency_done_bio(struct rq_qos *rqos, struct bio *bio)
+{
+ struct blkcg_gq *blkg;
+ struct rq_wait *rqw;
+ struct iolatency_grp *iolat;
+ u64 window_start;
+ u64 now = ktime_to_ns(ktime_get());
+ bool issue_as_root = bio_issue_as_root_blkg(bio);
+ bool enabled = false;
+
+ blkg = bio->bi_blkg;
+ if (!blkg)
+ return;
+
+ iolat = blkg_to_lat(blkg);
+ enabled = blk_iolatency_enabled(iolat->blkiolat);
+
+ while (blkg->parent) {
+ iolat = blkg_to_lat(blkg);
+ rqw = &iolat->rq_wait;
+
+ atomic_dec(&rqw->inflight);
+ if (!enabled || iolat->min_lat_nsec == 0)
+ goto next;
+ iolatency_record_time(iolat, &bio->bi_issue, now,
+ issue_as_root);
+ window_start = atomic64_read(&iolat->window_start);
+ if (now > window_start &&
+ (now - window_start) >= iolat->cur_win_nsec) {
+ if (atomic64_cmpxchg(&iolat->window_start,
+ window_start, now) == window_start)
+ iolatency_check_latencies(iolat, now);
+ }
+next:
+ wake_up_all(&rqw->wait);
+ blkg = blkg->parent;
+ }
+}
+
+static void blkcg_iolatency_cleanup(struct rq_qos *rqos, struct bio *bio)
+{
+ struct blkcg_gq *blkg;
+
+ blkg = bio->bi_blkg;
+ while (blkg && blkg->parent) {
+ struct rq_wait *rqw;
+ struct iolatency_grp *iolat;
+
+ iolat = blkg_to_lat(blkg);
+ rqw = &iolat->rq_wait;
+
+ atomic_dec(&rqw->inflight);
+ wake_up_all(&rqw->wait);
+ blkg = blkg->parent;
+ }
+}
+
+static void blkcg_iolatency_exit(struct rq_qos *rqos)
+{
+ struct blk_iolatency *blkiolat = BLKIOLATENCY(rqos);
+
+ del_timer_sync(&blkiolat->timer);
+ blkcg_deactivate_policy(rqos->q, &blkcg_policy_iolatency);
+ kfree(blkiolat);
+}
+
+static struct rq_qos_ops blkcg_iolatency_ops = {
+ .throttle = blkcg_iolatency_throttle,
+ .cleanup = blkcg_iolatency_cleanup,
+ .done_bio = blkcg_iolatency_done_bio,
+ .exit = blkcg_iolatency_exit,
+};
+
+static void blkiolatency_timer_fn(struct timer_list *t)
+{
+ struct blk_iolatency *blkiolat = from_timer(blkiolat, t, timer);
+ struct blkcg_gq *blkg;
+ struct cgroup_subsys_state *pos_css;
+ u64 now = ktime_to_ns(ktime_get());
+
+ rcu_read_lock();
+ blkg_for_each_descendant_pre(blkg, pos_css,
+ blkiolat->rqos.q->root_blkg) {
+ struct iolatency_grp *iolat = blkg_to_lat(blkg);
+ struct child_latency_info *lat_info = &iolat->child_lat;
+ unsigned long flags;
+ u64 cookie = atomic_read(&lat_info->scale_cookie);
+
+ if (cookie >= DEFAULT_SCALE_COOKIE)
+ continue;
+
+ spin_lock_irqsave(&lat_info->lock, flags);
+ if (lat_info->last_scale_event >= now)
+ goto next;
+
+ /*
+ * We scaled down but don't have a scale_grp, scale up and carry
+ * on.
+ */
+ if (lat_info->scale_grp == NULL) {
+ scale_cookie_change(iolat->blkiolat, lat_info, true);
+ goto next;
+ }
+
+ /*
+ * It's been 5 seconds since our last scale event, clear the
+ * scale grp in case the group that needed the scale down isn't
+ * doing any IO currently.
+ */
+ if (now - lat_info->last_scale_event >=
+ ((u64)NSEC_PER_SEC * 5))
+ lat_info->scale_grp = NULL;
+next:
+ spin_unlock_irqrestore(&lat_info->lock, flags);
+ }
+ rcu_read_unlock();
+}
+
+int blk_iolatency_init(struct request_queue *q)
+{
+ struct blk_iolatency *blkiolat;
+ struct rq_qos *rqos;
+ int ret;
+
+ blkiolat = kzalloc(sizeof(*blkiolat), GFP_KERNEL);
+ if (!blkiolat)
+ return -ENOMEM;
+
+ rqos = &blkiolat->rqos;
+ rqos->id = RQ_QOS_CGROUP;
+ rqos->ops = &blkcg_iolatency_ops;
+ rqos->q = q;
+
+ rq_qos_add(q, rqos);
+
+ ret = blkcg_activate_policy(q, &blkcg_policy_iolatency);
+ if (ret) {
+ kfree(blkiolat);
+ return ret;
+ }
+
+ timer_setup(&blkiolat->timer, blkiolatency_timer_fn,
+ (unsigned long)blkiolat);
+
+ return 0;
+}
+
+static void iolatency_set_min_lat_nsec(struct blkcg_gq *blkg, u64 val)
+{
+ struct iolatency_grp *iolat = blkg_to_lat(blkg);
+ struct blk_iolatency *blkiolat = iolat->blkiolat;
+ u64 oldval = iolat->min_lat_nsec;
+
+ iolat->min_lat_nsec = val;
+ iolat->cur_win_nsec = max_t(u64, val << 4, 100 * NSEC_PER_MSEC);
+ iolat->cur_win_nsec = min_t(u64, iolat->cur_win_nsec, NSEC_PER_SEC);
+
+ if (!oldval && val)
+ atomic_inc(&blkiolat->enabled);
+ if (oldval && !val)
+ atomic_dec(&blkiolat->enabled);
+}
+
+static void iolatency_clear_scaling(struct blkcg_gq *blkg)
+{
+ if (blkg->parent) {
+ struct iolatency_grp *iolat = blkg_to_lat(blkg->parent);
+ struct child_latency_info *lat_info = &iolat->child_lat;
+
+ spin_lock_irq(&lat_info->lock);
+ atomic_set(&lat_info->scale_cookie, DEFAULT_SCALE_COOKIE);
+ lat_info->last_scale_event = 0;
+ lat_info->scale_grp = NULL;
+ lat_info->scale_lat = 0;
+ spin_unlock_irq(&lat_info->lock);
+ }
+}
+
+static ssize_t iolatency_set_limit(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct blkcg *blkcg = css_to_blkcg(of_css(of));
+ struct blkcg_gq *blkg;
+ struct blk_iolatency *blkiolat;
+ struct blkg_conf_ctx ctx;
+ struct iolatency_grp *iolat;
+ char *p, *tok;
+ u64 lat_val = 0;
+ u64 oldval;
+ int ret;
+
+ ret = blkg_conf_prep(blkcg, &blkcg_policy_iolatency, buf, &ctx);
+ if (ret)
+ return ret;
+
+ iolat = blkg_to_lat(ctx.blkg);
+ blkiolat = iolat->blkiolat;
+ p = ctx.body;
+
+ ret = -EINVAL;
+ while ((tok = strsep(&p, " "))) {
+ char key[16];
+ char val[21]; /* 18446744073709551616 */
+
+ if (sscanf(tok, "%15[^=]=%20s", key, val) != 2)
+ goto out;
+
+ if (!strcmp(key, "target")) {
+ u64 v;
+
+ if (!strcmp(val, "max"))
+ lat_val = 0;
+ else if (sscanf(val, "%llu", &v) == 1)
+ lat_val = v * NSEC_PER_USEC;
+ else
+ goto out;
+ } else {
+ goto out;
+ }
+ }
+
+ /* Walk up the tree to see if our new val is lower than it should be. */
+ blkg = ctx.blkg;
+ oldval = iolat->min_lat_nsec;
+
+ iolatency_set_min_lat_nsec(blkg, lat_val);
+ if (oldval != iolat->min_lat_nsec)
+ iolatency_clear_scaling(blkg);
+
+ ret = 0;
+out:
+ blkg_conf_finish(&ctx);
+ return ret ?: nbytes;
+}
+
+static u64 iolatency_prfill_limit(struct seq_file *sf,
+ struct blkg_policy_data *pd, int off)
+{
+ struct iolatency_grp *iolat = pd_to_lat(pd);
+ const char *dname = blkg_dev_name(pd->blkg);
+
+ if (!dname || !iolat->min_lat_nsec)
+ return 0;
+ seq_printf(sf, "%s target=%llu\n",
+ dname,
+ (unsigned long long)iolat->min_lat_nsec / NSEC_PER_USEC);
+ return 0;
+}
+
+static int iolatency_print_limit(struct seq_file *sf, void *v)
+{
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+ iolatency_prfill_limit,
+ &blkcg_policy_iolatency, seq_cft(sf)->private, false);
+ return 0;
+}
+
+static size_t iolatency_pd_stat(struct blkg_policy_data *pd, char *buf,
+ size_t size)
+{
+ struct iolatency_grp *iolat = pd_to_lat(pd);
+ struct blkcg_gq *blkg = pd_to_blkg(pd);
+ unsigned long long avg_lat = div64_u64(iolat->total_lat_avg, NSEC_PER_USEC);
+
+ if (!iolat->min_lat_nsec)
+ return 0;
+
+ if (iolat->rq_depth.max_depth == (u64)-1)
+ return snprintf(buf, size, " depth=max avg_lat=%llu", avg_lat);
+
+ return snprintf(buf, size, " depth=%u avg_lat=%llu",
+ iolat->rq_depth.max_depth, avg_lat);
+}
+
+
+static struct blkg_policy_data *iolatency_pd_alloc(gfp_t gfp, int node)
+{
+ struct iolatency_grp *iolat;
+
+ iolat = kzalloc_node(sizeof(*iolat), gfp, node);
+ if (!iolat)
+ return NULL;
+ iolat->stats = __alloc_percpu_gfp(sizeof(struct blk_rq_stat),
+ __alignof__(struct blk_rq_stat), gfp);
+ if (!iolat->stats) {
+ kfree(iolat);
+ return NULL;
+ }
+ return &iolat->pd;
+}
+
+static void iolatency_pd_init(struct blkg_policy_data *pd)
+{
+ struct iolatency_grp *iolat = pd_to_lat(pd);
+ struct blkcg_gq *blkg = lat_to_blkg(iolat);
+ struct rq_qos *rqos = blkcg_rq_qos(blkg->q);
+ struct blk_iolatency *blkiolat = BLKIOLATENCY(rqos);
+ u64 now = ktime_to_ns(ktime_get());
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct blk_rq_stat *stat;
+ stat = per_cpu_ptr(iolat->stats, cpu);
+ blk_rq_stat_init(stat);
+ }
+
+ rq_wait_init(&iolat->rq_wait);
+ spin_lock_init(&iolat->child_lat.lock);
+ iolat->rq_depth.queue_depth = blk_queue_depth(blkg->q);
+ iolat->rq_depth.max_depth = INT_MAX;
+ iolat->rq_depth.default_depth = iolat->rq_depth.queue_depth;
+ iolat->blkiolat = blkiolat;
+ iolat->cur_win_nsec = 100 * NSEC_PER_MSEC;
+ atomic64_set(&iolat->window_start, now);
+
+ /*
+ * We init things in list order, so the pd for the parent may not be
+ * init'ed yet for whatever reason.
+ */
+ if (blkg->parent && blkg_to_pd(blkg->parent, &blkcg_policy_iolatency)) {
+ struct iolatency_grp *parent = blkg_to_lat(blkg->parent);
+ atomic_set(&iolat->scale_cookie,
+ atomic_read(&parent->child_lat.scale_cookie));
+ } else {
+ atomic_set(&iolat->scale_cookie, DEFAULT_SCALE_COOKIE);
+ }
+
+ atomic_set(&iolat->child_lat.scale_cookie, DEFAULT_SCALE_COOKIE);
+}
+
+static void iolatency_pd_offline(struct blkg_policy_data *pd)
+{
+ struct iolatency_grp *iolat = pd_to_lat(pd);
+ struct blkcg_gq *blkg = lat_to_blkg(iolat);
+
+ iolatency_set_min_lat_nsec(blkg, 0);
+ iolatency_clear_scaling(blkg);
+}
+
+static void iolatency_pd_free(struct blkg_policy_data *pd)
+{
+ struct iolatency_grp *iolat = pd_to_lat(pd);
+ free_percpu(iolat->stats);
+ kfree(iolat);
+}
+
+static struct cftype iolatency_files[] = {
+ {
+ .name = "latency",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = iolatency_print_limit,
+ .write = iolatency_set_limit,
+ },
+ {}
+};
+
+static struct blkcg_policy blkcg_policy_iolatency = {
+ .dfl_cftypes = iolatency_files,
+ .pd_alloc_fn = iolatency_pd_alloc,
+ .pd_init_fn = iolatency_pd_init,
+ .pd_offline_fn = iolatency_pd_offline,
+ .pd_free_fn = iolatency_pd_free,
+ .pd_stat_fn = iolatency_pd_stat,
+};
+
+static int __init iolatency_init(void)
+{
+ return blkcg_policy_register(&blkcg_policy_iolatency);
+}
+
+static void __exit iolatency_exit(void)
+{
+ return blkcg_policy_unregister(&blkcg_policy_iolatency);
+}
+
+module_init(iolatency_init);
+module_exit(iolatency_exit);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6c23404c01db..05f3a8ba564f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -904,6 +904,8 @@ int blk_register_queue(struct gendisk *disk)

wbt_enable_default(q);

+ blk_iolatency_init(q);
+
blk_throtl_register_queue(q);

if (q->request_fn || (q->mq_ops && q->elevator)) {
diff --git a/block/blk.h b/block/blk.h
index eaf1a8e87d11..f4752eb0eb56 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -409,4 +409,10 @@ static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)

extern void blk_drain_queue(struct request_queue *q);

+#ifdef CONFIG_BLK_CGROUP_IOLATENCY
+extern void blk_iolatency_init(struct request_queue *q);
+#else
+static inline void blk_iolatency_init(struct request_queue *q) { }
+#endif
+
#endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4f7ba397d37d..f91f8334000a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -180,9 +180,7 @@ struct bio {
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
struct blkcg_gq *bi_blkg;
-#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
struct bio_issue bi_issue;
-#endif
#endif
union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
--
2.14.3


2018-06-05 13:31:50

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 13/13] Documentation: add a doc for blk-iolatency

From: Josef Bacik <[email protected]>

A basic documentation to describe the interface, statistics, and
behavior of io.latency.

Signed-off-by: Josef Bacik <[email protected]>
---
Documentation/cgroup-v2.txt | 79 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 79 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeaed9f7a..f6684ec99720 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -51,6 +51,9 @@ v1 is available under Documentation/cgroup-v1/.
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
+ 5-3-3. IO Latency
+ 5-3-3-1. How IO Latency Throttling Works
+ 5-3-3-2. IO Latency Interface Files
5-4. PID
5-4-1. PID Interface Files
5-5. Device
@@ -1395,6 +1398,82 @@ writeback as follows.
vm.dirty[_background]_ratio.


+IO Latency
+~~~~~~~~~~
+
+This is a cgroup v2 controller for IO workload protection. You provide a group
+with a latency target, and if the average latency exceeds that target the
+controller will throttle any peers that have a lower latency target than the
+protected workload.
+
+The limits are only applied at the peer level in the hierarchy. This means that
+in the diagram below, only groups A, B, and C will influence each other, and
+groups D and F will influence each other. Group G will influence nobody.
+
+ [root]
+ / | \
+ A B C
+ / \ |
+ D F G
+
+
+So the ideal way to configure this is to set io.latency in groups A, B, and C.
+Generally you do not want to set a value lower than the latency your device
+supports. Experiment to find the value that works best for your workload, start
+at higher than the expected latency for your device and watch the total_lat_avg
+value in io.stat for your workload group to get an idea of the latency you see
+during normal operation. Use this value as a basis for your real setting,
+setting at 10-15% higher than the value in io.stat. Experimentation is key here
+because total_lat_avg is a running total, so is the "statistics" portion of
+"lies, damned lies, and statistics."
+
+How IO Latency Throttling Works
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+io.latency is work conserving, so as long as everybody is meeting their latency
+target the controller doesn't do anything. Once a group starts missing it's
+target it begins throttling any peer group that has a higher target than itself.
+This throttling takes 2 forms:
+
+- Queue depth throttling. This is the number of outstanding IO's a group is
+ allowed to have. We will clamp down relatively quickly, starting at no limit
+ and going all the way down to 1 IO at a time.
+
+- Artificial delay induction. There are certain types of IO that cannot be
+ throttled without possibly adversely affecting higher priority groups. This
+ includes swapping and metadata IO. These types of IO are allowed to occur
+ normally, however they are "charged" to the originating group. If the
+ originating group is being throttled you will see the use_delay and delay
+ fields in io.stat increase. The delay value is how many microseconds that are
+ being added to any process that runs in this group. Because this number can
+ grow quite large if there is a lot of swapping or metadata IO occurring we
+ limit the individual delay events to 1 second at a time.
+
+Once the victimized group starts meeting it's latency target again it will start
+unthrottling any peer groups that were throttled previously. If the victimized
+group simply stops doing IO the global counter will unthrottle appropriately.
+
+IO Latency Interface Files
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ io.latency
+ This takes a similar format as the other controllers.
+
+ "MAJOR:MINOR target=<target time in microseconds"
+
+ io.stat
+ If the controller is enabled you will see extra stats in io.stat in
+ addition to the normal ones.
+
+ depth
+ This is the current queue depth for the group.
+
+ avg_lat
+ The running average IO latency for this group in microseconds.
+ Running average is generally flawed, but will give an
+ administrator a general idea of the overall latency they can
+ expect for their workload on the given disk.
+
PID
---

--
2.14.3


2018-06-05 13:33:14

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 10/13] block: remove external dependency on wbt_flags

From: Josef Bacik <[email protected]>

We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.

Signed-off-by: Josef Bacik <[email protected]>
---
block/blk-core.c | 9 ++++-----
block/blk-mq.c | 9 ++++-----
block/blk-rq-qos.c | 24 +++++++++++++++---------
block/blk-rq-qos.h | 11 ++++++-----
block/blk-wbt.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
block/blk-wbt.h | 5 -----
6 files changed, 68 insertions(+), 42 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 07594fb34c0f..af67fa68f635 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -42,7 +42,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-sched.h"
-#include "blk-wbt.h"
+#include "blk-rq-qos.h"

#ifdef CONFIG_DEBUG_FS
struct dentry *blk_debugfs_root;
@@ -1994,7 +1994,6 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
int where = ELEVATOR_INSERT_SORT;
struct request *req, *free;
unsigned int request_count = 0;
- unsigned int wb_acct;

/*
* low level driver can indicate that it wants pages above a
@@ -2052,7 +2051,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
}

get_rq:
- wb_acct = rq_qos_throttle(q, bio, q->queue_lock);
+ rq_qos_throttle(q, bio, q->queue_lock);

/*
* Grab a free request. This is might sleep but can not fail.
@@ -2062,7 +2061,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
req = get_request(q, bio->bi_opf, bio, 0, GFP_NOIO);
if (IS_ERR(req)) {
blk_queue_exit(q);
- rq_qos_cleanup(q, wb_acct);
+ rq_qos_cleanup(q, bio);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
else
@@ -2071,7 +2070,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
goto out_unlock;
}

- wbt_track(req, wb_acct);
+ rq_qos_track(q, req, bio);

/*
* After dropping the lock and possibly sleeping here, our request
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5308396ee22a..7c55c9d35492 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -34,8 +34,8 @@
#include "blk-mq-debugfs.h"
#include "blk-mq-tag.h"
#include "blk-stat.h"
-#include "blk-wbt.h"
#include "blk-mq-sched.h"
+#include "blk-rq-qos.h"

static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
static void blk_mq_poll_stats_start(struct request_queue *q);
@@ -1852,7 +1852,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
struct blk_plug *plug;
struct request *same_queue_rq = NULL;
blk_qc_t cookie;
- unsigned int wb_acct;

blk_queue_bounce(q, &bio);

@@ -1868,19 +1867,19 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
if (blk_mq_sched_bio_merge(q, bio))
return BLK_QC_T_NONE;

- wb_acct = rq_qos_throttle(q, bio, NULL);
+ rq_qos_throttle(q, bio, NULL);

trace_block_getrq(q, bio, bio->bi_opf);

rq = blk_mq_get_request(q, bio, bio->bi_opf, &data);
if (unlikely(!rq)) {
- rq_qos_cleanup(q, wb_acct);
+ rq_qos_cleanup(q, bio);
if (bio->bi_opf & REQ_NOWAIT)
bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}

- wbt_track(rq, wb_acct);
+ rq_qos_track(q, rq, bio);

cookie = request_to_qc_t(data.hctx, rq);

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index d2f2af8aa10c..b7b02e04f64f 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -1,7 +1,5 @@
#include "blk-rq-qos.h"

-#include "blk-wbt.h"
-
/*
* Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
* false if 'v' + 1 would be bigger than 'below'.
@@ -29,13 +27,13 @@ bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit)
return atomic_inc_below(&rq_wait->inflight, limit);
}

-void rq_qos_cleanup(struct request_queue *q, enum wbt_flags wb_acct)
+void rq_qos_cleanup(struct request_queue *q, struct bio *bio)
{
struct rq_qos *rqos;

for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
if (rqos->ops->cleanup)
- rqos->ops->cleanup(rqos, wb_acct);
+ rqos->ops->cleanup(rqos, bio);
}
}

@@ -69,17 +67,25 @@ void rq_qos_requeue(struct request_queue *q, struct request *rq)
}
}

-enum wbt_flags rq_qos_throttle(struct request_queue *q, struct bio *bio,
- spinlock_t *lock)
+void rq_qos_throttle(struct request_queue *q, struct bio *bio,
+ spinlock_t *lock)
{
struct rq_qos *rqos;
- enum wbt_flags flags = 0;

for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
if (rqos->ops->throttle)
- flags |= rqos->ops->throttle(rqos, bio, lock);
+ rqos->ops->throttle(rqos, bio, lock);
+ }
+}
+
+void rq_qos_track(struct request_queue *q, struct request *rq, struct bio *bio)
+{
+ struct rq_qos *rqos;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->track)
+ rqos->ops->track(rqos, rq, bio);
}
- return flags;
}

/*
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index a321e9e2ba70..501253676dd8 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -25,12 +25,12 @@ struct rq_qos {
};

struct rq_qos_ops {
- enum wbt_flags (*throttle)(struct rq_qos *, struct bio *,
- spinlock_t *);
+ void (*throttle)(struct rq_qos *, struct bio *, spinlock_t *);
+ void (*track)(struct rq_qos *, struct request *, struct bio *);
void (*issue)(struct rq_qos *, struct request *);
void (*requeue)(struct rq_qos *, struct request *);
void (*done)(struct rq_qos *, struct request *);
- void (*cleanup)(struct rq_qos *, enum wbt_flags);
+ void (*cleanup)(struct rq_qos *, struct bio *);
void (*exit)(struct rq_qos *);
};

@@ -82,10 +82,11 @@ void rq_depth_scale_up(struct rq_depth *rqd);
void rq_depth_scale_down(struct rq_depth *rqd, bool hard_throttle);
bool rq_depth_calc_max_depth(struct rq_depth *rqd);

-void rq_qos_cleanup(struct request_queue *, enum wbt_flags);
+void rq_qos_cleanup(struct request_queue *, struct bio *);
void rq_qos_done(struct request_queue *, struct request *);
void rq_qos_issue(struct request_queue *, struct request *);
void rq_qos_requeue(struct request_queue *, struct request *);
-enum wbt_flags rq_qos_throttle(struct request_queue *, struct bio *, spinlock_t *);
+void rq_qos_throttle(struct request_queue *, struct bio *, spinlock_t *);
+void rq_qos_track(struct request_queue *q, struct request *, struct bio *);
void rq_qos_exit(struct request_queue *);
#endif
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 6fe20fb823e4..461a9af11efe 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -549,41 +549,66 @@ static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
}
}

+static enum wbt_flags bio_to_wbt_flags(struct rq_wb *rwb, struct bio *bio)
+{
+ enum wbt_flags flags = 0;
+
+ if (bio_op(bio) == REQ_OP_READ) {
+ flags = WBT_READ;
+ } else if (wbt_should_throttle(rwb, bio)) {
+ if (current_is_kswapd())
+ flags |= WBT_KSWAPD;
+ if (bio_op(bio) == REQ_OP_DISCARD)
+ flags |= WBT_DISCARD;
+ flags |= WBT_TRACKED;
+ }
+ return flags;
+}
+
+static void wbt_cleanup(struct rq_qos *rqos, struct bio *bio)
+{
+ struct rq_wb *rwb = RQWB(rqos);
+ enum wbt_flags flags = bio_to_wbt_flags(rwb, bio);
+ __wbt_done(rqos, flags);
+}
+
/*
* Returns true if the IO request should be accounted, false if not.
* May sleep, if we have exceeded the writeback limits. Caller can pass
* in an irq held spinlock, if it holds one when calling this function.
* If we do sleep, we'll release and re-grab it.
*/
-static enum wbt_flags wbt_wait(struct rq_qos *rqos, struct bio *bio,
- spinlock_t *lock)
+static void wbt_wait(struct rq_qos *rqos, struct bio *bio, spinlock_t *lock)
{
struct rq_wb *rwb = RQWB(rqos);
- enum wbt_flags ret = 0;
+ enum wbt_flags flags;

if (!rwb_enabled(rwb))
- return 0;
+ return;

- if (bio_op(bio) == REQ_OP_READ)
- ret = WBT_READ;
+ flags = bio_to_wbt_flags(rwb, bio);

if (!wbt_should_throttle(rwb, bio)) {
- if (ret & WBT_READ)
+ if (flags & WBT_READ)
wb_timestamp(rwb, &rwb->last_issue);
- return ret;
+ return;
}

if (current_is_kswapd())
- ret |= WBT_KSWAPD;
+ flags |= WBT_KSWAPD;
if (bio_op(bio) == REQ_OP_DISCARD)
- ret |= WBT_DISCARD;
+ flags |= WBT_DISCARD;

- __wbt_wait(rwb, ret, bio->bi_opf, lock);
+ __wbt_wait(rwb, flags, bio->bi_opf, lock);

if (!blk_stat_is_active(rwb->cb))
rwb_arm_timer(rwb);
+}

- return ret | WBT_TRACKED;
+static void wbt_track(struct rq_qos *rqos, struct request *rq, struct bio *bio)
+{
+ struct rq_wb *rwb = RQWB(rqos);
+ rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
}

void wbt_issue(struct rq_qos *rqos, struct request *rq)
@@ -707,9 +732,10 @@ EXPORT_SYMBOL_GPL(wbt_disable_default);
static struct rq_qos_ops wbt_rqos_ops = {
.throttle = wbt_wait,
.issue = wbt_issue,
+ .track = wbt_track,
.requeue = wbt_requeue,
.done = wbt_done,
- .cleanup = __wbt_done,
+ .cleanup = wbt_cleanup,
.exit = wbt_exit,
};

diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 53b20a58c0a2..f47218d5b3b2 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -87,11 +87,6 @@ static inline unsigned int wbt_inflight(struct rq_wb *rwb)

#ifdef CONFIG_BLK_WBT

-static inline void wbt_track(struct request *rq, enum wbt_flags flags)
-{
- rq->wbt_flags |= flags;
-}
-
int wbt_init(struct request_queue *);
void wbt_update_limits(struct request_queue *);
void wbt_disable_default(struct request_queue *);
--
2.14.3


2018-06-05 13:33:25

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 06/13] blkcg: add generic throttling mechanism

From: Josef Bacik <[email protected]>

Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies. The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources. Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.

This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP. We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.

Signed-off-by: Josef Bacik <[email protected]>
---
block/blk-cgroup.c | 207 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/blk-cgroup.h | 72 +++++++++++++++
include/linux/cgroup-defs.h | 3 +
include/linux/sched.h | 8 ++
include/linux/tracehook.h | 2 +
5 files changed, 292 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b2ac6a7f851f..175a6d8f593d 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -27,6 +27,7 @@
#include <linux/atomic.h>
#include <linux/ctype.h>
#include <linux/blk-cgroup.h>
+#include <linux/tracehook.h>
#include "blk.h"

#define MAX_KEY_LEN 100
@@ -1340,6 +1341,13 @@ static void blkcg_bind(struct cgroup_subsys_state *root_css)
mutex_unlock(&blkcg_pol_mutex);
}

+static void blkcg_exit(struct task_struct *tsk)
+{
+ if (tsk->throttle_queue)
+ blk_put_queue(tsk->throttle_queue);
+ tsk->throttle_queue = NULL;
+}
+
struct cgroup_subsys io_cgrp_subsys = {
.css_alloc = blkcg_css_alloc,
.css_offline = blkcg_css_offline,
@@ -1349,6 +1357,7 @@ struct cgroup_subsys io_cgrp_subsys = {
.dfl_cftypes = blkcg_files,
.legacy_cftypes = blkcg_legacy_files,
.legacy_name = "blkio",
+ .exit = blkcg_exit,
#ifdef CONFIG_MEMCG
/*
* This ensures that, if available, memcg is automatically enabled
@@ -1600,5 +1609,203 @@ void blkcg_policy_unregister(struct blkcg_policy *pol)
}
EXPORT_SYMBOL_GPL(blkcg_policy_unregister);

+/*
+ * Scale the accumulated delay based on how long it has been since we updated
+ * the delay. We only call this when we are adding delay, in case it's been a
+ * while since we added delay, and when we are checking to see if we need to
+ * delay a task, to account for any delays that may have occurred.
+ */
+static void blkcg_scale_delay(struct blkcg_gq *blkg, u64 now)
+{
+ u64 old = atomic64_read(&blkg->delay_start);
+
+ /*
+ * We only want to scale down every second. The idea here is that we
+ * want to delay people for min(delay_nsec, NSEC_PER_SEC) in a certain
+ * time window. We only want to throttle tasks for recent delay that
+ * has occurred, in 1 second time windows since that's the maximum
+ * things can be throttled. We save the current delay window in
+ * blkg->last_delay so we know what amount is still left to be charged
+ * to the blkg from this point onward. blkg->last_use keeps track of
+ * the use_delay counter. The idea is if we're unthrottling the blkg we
+ * are ok with whatever is happening now, and we can take away more of
+ * the accumulated delay as we've already throttled enough that
+ * everybody is happy with their IO latencies.
+ */
+ if (time_before64(old + NSEC_PER_SEC, now) &&
+ atomic64_cmpxchg(&blkg->delay_start, old, now) == old) {
+ u64 cur = atomic64_read(&blkg->delay_nsec);
+ u64 sub = min_t(u64, blkg->last_delay, now - old);
+ int cur_use = atomic_read(&blkg->use_delay);
+
+ /*
+ * We've been unthrottled, subtract a larger chunk of our
+ * accumulated delay.
+ */
+ if (cur_use < blkg->last_use)
+ sub = max_t(u64, sub, blkg->last_delay >> 1);
+
+ /*
+ * This shouldn't happen, but handle it anyway. Our delay_nsec
+ * should only ever be growing except here where we subtract out
+ * min(last_delay, 1 second), but lord knows bugs happen and I'd
+ * rather not end up with negative numbers.
+ */
+ if (unlikely(cur < sub)) {
+ atomic64_set(&blkg->delay_nsec, 0);
+ blkg->last_delay = 0;
+ } else {
+ atomic64_sub(sub, &blkg->delay_nsec);
+ blkg->last_delay = cur - sub;
+ }
+ blkg->last_use = cur_use;
+ }
+}
+
+/*
+ * This is called when we want to actually walk up the hierarchy and check to
+ * see if we need to throttle, and then actually throttle if there is some
+ * accumulated delay. This should only be called upon return to user space so
+ * we're not holding some lock that would induce a priority inversion.
+ */
+static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay)
+{
+ u64 now = ktime_to_ns(ktime_get());
+ u64 exp;
+ u64 delay_nsec = 0;
+ int tok;
+
+ while (blkg->parent) {
+ if (atomic_read(&blkg->use_delay)) {
+ blkcg_scale_delay(blkg, now);
+ delay_nsec = max_t(u64, delay_nsec,
+ atomic64_read(&blkg->delay_nsec));
+ }
+ blkg = blkg->parent;
+ }
+
+ if (!delay_nsec)
+ return;
+
+ /*
+ * Let's not sleep for all eternity if we've amassed a huge delay.
+ * Swapping or metadata IO can accumulate 10's of seconds worth of
+ * delay, and we want userspace to be able to do _something_ so cap the
+ * delays at 1 second. If there's 10's of seconds worth of delay then
+ * the tasks will be delayed for 1 second for every syscall.
+ */
+ delay_nsec = min_t(u64, delay_nsec, NSEC_PER_SEC);
+
+ /*
+ * TODO: the use_memdelay flag is going to be for the upcoming psi stuff
+ * that hasn't landed upstream yet. Once that stuff is in place we need
+ * to do a psi_memstall_enter/leave if memdelay is set.
+ */
+
+ exp = ktime_add_ns(now, delay_nsec);
+ tok = io_schedule_prepare();
+ do {
+ __set_current_state(TASK_KILLABLE);
+ if (!schedule_hrtimeout(&exp, HRTIMER_MODE_ABS))
+ break;
+ } while (!fatal_signal_pending(current));
+ io_schedule_finish(tok);
+}
+
+/**
+ * blkcg_maybe_throttle_current - throttle the current task if it has been marked
+ *
+ * This is only called if we've been marked with set_notify_resume(). Obviously
+ * we can be set_notify_resume() for reasons other than blkcg throttling, so we
+ * check to see if current->throttle_queue is set and if not this doesn't do
+ * anything. This should only ever be called by the resume code, it's not meant
+ * to be called by people willy-nilly as it will actually do the work to
+ * throttle the task if it is setup for throttling.
+ */
+void blkcg_maybe_throttle_current(void)
+{
+ struct request_queue *q = current->throttle_queue;
+ struct cgroup_subsys_state *css;
+ struct blkcg *blkcg;
+ struct blkcg_gq *blkg;
+ bool use_memdelay = current->use_memdelay;
+
+ if (!q)
+ return;
+
+ current->throttle_queue = NULL;
+ current->use_memdelay = false;
+
+ rcu_read_lock();
+ css = kthread_blkcg();
+ if (css)
+ blkcg = css_to_blkcg(css);
+ else
+ blkcg = css_to_blkcg(task_css(current, io_cgrp_id));
+
+ if (!blkcg)
+ goto out;
+ blkg = blkg_lookup(blkcg, q);
+ if (!blkg)
+ goto out;
+ blkg = blkg_try_get(blkg);
+ if (!blkg)
+ goto out;
+ rcu_read_unlock();
+ blk_put_queue(q);
+
+ blkcg_maybe_throttle_blkg(blkg, use_memdelay);
+ blkg_put(blkg);
+ return;
+out:
+ rcu_read_unlock();
+ blk_put_queue(q);
+}
+EXPORT_SYMBOL_GPL(blkcg_maybe_throttle_current);
+
+/**
+ * blkcg_schedule_throttle - this task needs to check for throttling
+ * @q - the request queue IO was submitted on
+ * @use_memdelay - do we charge this to memory delay for PSI
+ *
+ * This is called by the IO controller when we know there's delay accumulated
+ * for the blkg for this task. We do not pass the blkg because there are places
+ * we call this that may not have that information, the swapping code for
+ * instance will only have a request_queue at that point. This set's the
+ * notify_resume for the task to check and see if it requires throttling before
+ * returning to user space.
+ */
+void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay)
+{
+ if (unlikely(current->flags & PF_KTHREAD))
+ return;
+
+ if (!blk_get_queue(q))
+ return;
+
+ if (current->throttle_queue)
+ blk_put_queue(current->throttle_queue);
+ current->throttle_queue = q;
+ if (use_memdelay)
+ current->use_memdelay = use_memdelay;
+ set_notify_resume(current);
+}
+EXPORT_SYMBOL_GPL(blkcg_schedule_throttle);
+
+/**
+ * blkcg_add_delay - add delay to this blkg
+ * @now - the current time in nanoseconds
+ * @delta - how many nanoseconds of delay to add
+ *
+ * Charge @delta to the blkg's current delay accumulation. This is used to
+ * throttle tasks if an IO controller thinks we need more throttling.
+ */
+void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
+{
+ blkcg_scale_delay(blkg, now);
+ atomic64_add(delta, &blkg->delay_nsec);
+}
+EXPORT_SYMBOL_GPL(blkcg_add_delay);
+
module_param(blkcg_debug_stats, bool, 0444);
MODULE_PARM_DESC(blkcg_debug_stats, "True if you want debug stats, false if not");
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index a8f9ba8f33a4..327d00367c6b 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -136,6 +136,12 @@ struct blkcg_gq {
struct blkg_policy_data *pd[BLKCG_MAX_POLS];

struct rcu_head rcu_head;
+
+ atomic_t use_delay;
+ atomic64_t delay_nsec;
+ atomic64_t delay_start;
+ u64 last_delay;
+ int last_use;
};

typedef struct blkcg_policy_data *(blkcg_pol_alloc_cpd_fn)(gfp_t gfp);
@@ -374,6 +380,21 @@ static inline void blkg_get(struct blkcg_gq *blkg)
atomic_inc(&blkg->refcnt);
}

+/**
+ * blkg_try_get - try and get a blkg reference
+ * @blkg: blkg to get
+ *
+ * This is for use when doing an RCU lookup of the blkg. We may be in the midst
+ * of freeing this blkg, so we can only use it if the refcnt is not zero.
+ */
+static inline struct blkcg_gq *blkg_try_get(struct blkcg_gq *blkg)
+{
+ if (atomic_inc_not_zero(&blkg->refcnt))
+ return blkg;
+ return NULL;
+}
+
+
void __blkg_release_rcu(struct rcu_head *rcu);

/**
@@ -734,6 +755,53 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
return !throtl;
}

+static inline void blkcg_use_delay(struct blkcg_gq *blkg)
+{
+ if (atomic_inc_and_test(&blkg->use_delay))
+ atomic_inc(&blkg->blkcg->css.cgroup->congestion_count);
+}
+
+static inline int blkcg_unuse_delay(struct blkcg_gq *blkg)
+{
+ int old = atomic_read(&blkg->use_delay);
+
+ if (old == 0)
+ return 0;
+
+ /*
+ * We do this song and dance because we can race with somebody else
+ * adding or removing delay. If we just did an atomic_dec we'd end up
+ * negative and we'd already be in trouble. We need to subtract 1 and
+ * then check to see if we were the last delay so we can drop the
+ * congestion count on the cgroup.
+ */
+ while (old) {
+ int cur = atomic_cmpxchg(&blkg->use_delay, old, old - 1);
+ if (cur == old)
+ break;
+ cur = old;
+ }
+
+ if (old == 0)
+ return 0;
+ if (old == 1)
+ atomic_dec(&blkg->blkcg->css.cgroup->congestion_count);
+ return 1;
+}
+
+static inline void blkcg_clear_delay(struct blkcg_gq *blkg)
+{
+ int old = atomic_read(&blkg->use_delay);
+ if (!old)
+ return;
+ /* We only want 1 person clearing the congestion count for this blkg. */
+ if (atomic_cmpxchg(&blkg->use_delay, old, 0) == old)
+ atomic_dec(&blkg->blkcg->css.cgroup->congestion_count);
+}
+
+void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta);
+void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay);
+void blkcg_maybe_throttle_current(void);
#else /* CONFIG_BLK_CGROUP */

struct blkcg {
@@ -753,8 +821,12 @@ struct blkcg_policy {

#define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))

+static inline void blkcg_maybe_throttle_current(void) { }
+
#ifdef CONFIG_BLOCK

+static inline void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay) { }
+
static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
static inline void blkcg_drain_queue(struct request_queue *q) { }
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index dc5b70449dc6..b3ab17d53f3d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -427,6 +427,9 @@ struct cgroup {
/* used to store eBPF programs */
struct cgroup_bpf bpf;

+ /* If there is block congestion on this cgroup. */
+ atomic_t congestion_count;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
};
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3d697f3b573..6fe0d3c1409b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -693,6 +693,10 @@ struct task_struct {
/* disallow userland-initiated cgroup migration */
unsigned no_cgroup_migration:1;
#endif
+#ifdef CONFIG_BLK_CGROUP
+ /* to be used once the psi infrastructure lands upstream. */
+ unsigned use_memdelay:1;
+#endif

unsigned long atomic_flags; /* Flags requiring atomic access. */

@@ -1099,6 +1103,10 @@ struct task_struct {
unsigned int memcg_nr_pages_over_high;
#endif

+#ifdef CONFIG_BLK_CGROUP
+ struct request_queue *throttle_queue;
+#endif
+
#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
#endif
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 26c152122a42..4e24930306b9 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -51,6 +51,7 @@
#include <linux/security.h>
#include <linux/task_work.h>
#include <linux/memcontrol.h>
+#include <linux/blk-cgroup.h>
struct linux_binprm;

/*
@@ -191,6 +192,7 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
task_work_run();

mem_cgroup_handle_over_high();
+ blkcg_maybe_throttle_current();
}

#endif /* <linux/tracehook.h> */
--
2.14.3


2018-06-05 13:33:35

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 11/13] rq-qos: introduce dio_bio callback

From: Josef Bacik <[email protected]>

wbt cares only about request completion time, but controllers may need
information that is on the bio itself, so add a done_bio callback for
rq-qos so things like blk-iolatency can use it to have the bio when it
completes.

Signed-off-by: Josef Bacik <[email protected]>
---
block/bio.c | 4 ++++
block/blk-rq-qos.c | 10 ++++++++++
block/blk-rq-qos.h | 2 ++
3 files changed, 16 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index c77fe1b4caa8..99809b73a400 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -32,6 +32,7 @@

#include <trace/events/block.h>
#include "blk.h"
+#include "blk-rq-qos.h"

/*
* Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1781,6 +1782,9 @@ void bio_endio(struct bio *bio)
if (WARN_ONCE(bio->bi_next, "driver left bi_next not NULL"))
bio->bi_next = NULL;

+ if (bio->bi_disk)
+ rq_qos_done_bio(bio->bi_disk->queue, bio);
+
/*
* Need to have a real endio function for chained bios, otherwise
* various corner cases will break (like stacking block devices that
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index b7b02e04f64f..5134b24482f6 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -88,6 +88,16 @@ void rq_qos_track(struct request_queue *q, struct request *rq, struct bio *bio)
}
}

+void rq_qos_done_bio(struct request_queue *q, struct bio *bio)
+{
+ struct rq_qos *rqos;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->done_bio)
+ rqos->ops->done_bio(rqos, bio);
+ }
+}
+
/*
* Return true, if we can't increase the depth further by scaling
*/
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 501253676dd8..3874e6ad09f0 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -30,6 +30,7 @@ struct rq_qos_ops {
void (*issue)(struct rq_qos *, struct request *);
void (*requeue)(struct rq_qos *, struct request *);
void (*done)(struct rq_qos *, struct request *);
+ void (*done_bio)(struct rq_qos *, struct bio *);
void (*cleanup)(struct rq_qos *, struct bio *);
void (*exit)(struct rq_qos *);
};
@@ -86,6 +87,7 @@ void rq_qos_cleanup(struct request_queue *, struct bio *);
void rq_qos_done(struct request_queue *, struct request *);
void rq_qos_issue(struct request_queue *, struct request *);
void rq_qos_requeue(struct request_queue *, struct request *);
+void rq_qos_done_bio(struct request_queue *q, struct bio *bio);
void rq_qos_throttle(struct request_queue *, struct bio *, spinlock_t *);
void rq_qos_track(struct request_queue *q, struct request *, struct bio *);
void rq_qos_exit(struct request_queue *);
--
2.14.3


2018-06-05 13:33:48

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 09/13] blk-rq-qos: refactor out common elements of blk-wbt

From: Josef Bacik <[email protected]>

blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.

Signed-off-by: Josef Bacik <[email protected]>
---
block/Makefile | 2 +-
block/blk-core.c | 12 +-
block/blk-mq.c | 12 +-
block/blk-rq-qos.c | 178 +++++++++++++++++++++++++++
block/blk-rq-qos.h | 91 ++++++++++++++
block/blk-settings.c | 4 +-
block/blk-sysfs.c | 22 ++--
block/blk-wbt.c | 326 ++++++++++++++++++++++---------------------------
block/blk-wbt.h | 63 ++++------
include/linux/blkdev.h | 4 +-
10 files changed, 463 insertions(+), 251 deletions(-)
create mode 100644 block/blk-rq-qos.c
create mode 100644 block/blk-rq-qos.h

diff --git a/block/Makefile b/block/Makefile
index 6a56303b9925..981605042cad 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o partition-generic.o ioprio.o \
- badblocks.o partitions/
+ badblocks.o partitions/ blk-rq-qos.o

obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 43370faee935..07594fb34c0f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1653,7 +1653,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
blk_delete_timer(rq);
blk_clear_rq_complete(rq);
trace_block_rq_requeue(q, rq);
- wbt_requeue(q->rq_wb, rq);
+ rq_qos_requeue(q, rq);

if (rq->rq_flags & RQF_QUEUED)
blk_queue_end_tag(q, rq);
@@ -1760,7 +1760,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
/* this is a bio leak */
WARN_ON(req->bio != NULL);

- wbt_done(q->rq_wb, req);
+ rq_qos_done(q, req);

/*
* Request may not have originated from ll_rw_blk. if not,
@@ -2052,7 +2052,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
}

get_rq:
- wb_acct = wbt_wait(q->rq_wb, bio, q->queue_lock);
+ wb_acct = rq_qos_throttle(q, bio, q->queue_lock);

/*
* Grab a free request. This is might sleep but can not fail.
@@ -2062,7 +2062,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
req = get_request(q, bio->bi_opf, bio, 0, GFP_NOIO);
if (IS_ERR(req)) {
blk_queue_exit(q);
- __wbt_done(q->rq_wb, wb_acct);
+ rq_qos_cleanup(q, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
else
@@ -2989,7 +2989,7 @@ void blk_start_request(struct request *req)
req->throtl_size = blk_rq_sectors(req);
#endif
req->rq_flags |= RQF_STATS;
- wbt_issue(req->q->rq_wb, req);
+ rq_qos_issue(req->q, req);
}

BUG_ON(blk_rq_is_complete(req));
@@ -3211,7 +3211,7 @@ void blk_finish_request(struct request *req, blk_status_t error)
blk_account_io_done(req, now);

if (req->end_io) {
- wbt_done(req->q->rq_wb, req);
+ rq_qos_done(q, req);
req->end_io(req, error);
} else {
if (blk_bidi_rq(req))
diff --git a/block/blk-mq.c b/block/blk-mq.c
index df928200b17e..5308396ee22a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -489,7 +489,7 @@ void blk_mq_free_request(struct request *rq)
if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
laptop_io_completion(q->backing_dev_info);

- wbt_done(q->rq_wb, rq);
+ rq_qos_done(q, rq);

if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
@@ -516,7 +516,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
blk_account_io_done(rq, now);

if (rq->end_io) {
- wbt_done(rq->q->rq_wb, rq);
+ rq_qos_done(rq->q, rq);
rq->end_io(rq, error);
} else {
if (unlikely(blk_bidi_rq(rq)))
@@ -678,7 +678,7 @@ void blk_mq_start_request(struct request *rq)
rq->throtl_size = blk_rq_sectors(rq);
#endif
rq->rq_flags |= RQF_STATS;
- wbt_issue(q->rq_wb, rq);
+ rq_qos_issue(q, rq);
}

WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
@@ -726,7 +726,7 @@ static void __blk_mq_requeue_request(struct request *rq)
blk_mq_put_driver_tag(rq);

trace_block_rq_requeue(q, rq);
- wbt_requeue(q->rq_wb, rq);
+ rq_qos_requeue(q, rq);

if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
@@ -1868,13 +1868,13 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
if (blk_mq_sched_bio_merge(q, bio))
return BLK_QC_T_NONE;

- wb_acct = wbt_wait(q->rq_wb, bio, NULL);
+ wb_acct = rq_qos_throttle(q, bio, NULL);

trace_block_getrq(q, bio, bio->bi_opf);

rq = blk_mq_get_request(q, bio, bio->bi_opf, &data);
if (unlikely(!rq)) {
- __wbt_done(q->rq_wb, wb_acct);
+ rq_qos_cleanup(q, wb_acct);
if (bio->bi_opf & REQ_NOWAIT)
bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
new file mode 100644
index 000000000000..d2f2af8aa10c
--- /dev/null
+++ b/block/blk-rq-qos.c
@@ -0,0 +1,178 @@
+#include "blk-rq-qos.h"
+
+#include "blk-wbt.h"
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+ int cur = atomic_read(v);
+
+ for (;;) {
+ int old;
+
+ if (cur >= below)
+ return false;
+ old = atomic_cmpxchg(v, cur, cur + 1);
+ if (old == cur)
+ break;
+ cur = old;
+ }
+
+ return true;
+}
+
+bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit)
+{
+ return atomic_inc_below(&rq_wait->inflight, limit);
+}
+
+void rq_qos_cleanup(struct request_queue *q, enum wbt_flags wb_acct)
+{
+ struct rq_qos *rqos;
+
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->cleanup)
+ rqos->ops->cleanup(rqos, wb_acct);
+ }
+}
+
+void rq_qos_done(struct request_queue *q, struct request *rq)
+{
+ struct rq_qos *rqos;
+
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->done)
+ rqos->ops->done(rqos, rq);
+ }
+}
+
+void rq_qos_issue(struct request_queue *q, struct request *rq)
+{
+ struct rq_qos *rqos;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->issue)
+ rqos->ops->issue(rqos, rq);
+ }
+}
+
+void rq_qos_requeue(struct request_queue *q, struct request *rq)
+{
+ struct rq_qos *rqos;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->requeue)
+ rqos->ops->requeue(rqos, rq);
+ }
+}
+
+enum wbt_flags rq_qos_throttle(struct request_queue *q, struct bio *bio,
+ spinlock_t *lock)
+{
+ struct rq_qos *rqos;
+ enum wbt_flags flags = 0;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->throttle)
+ flags |= rqos->ops->throttle(rqos, bio, lock);
+ }
+ return flags;
+}
+
+/*
+ * Return true, if we can't increase the depth further by scaling
+ */
+bool rq_depth_calc_max_depth(struct rq_depth *rqd)
+{
+ unsigned int depth;
+ bool ret = false;
+
+ /*
+ * For QD=1 devices, this is a special case. It's important for those
+ * to have one request ready when one completes, so force a depth of
+ * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
+ * since the device can't have more than that in flight. If we're
+ * scaling down, then keep a setting of 1/1/1.
+ */
+ if (rqd->queue_depth == 1) {
+ if (rqd->scale_step > 0)
+ rqd->max_depth = 1;
+ else {
+ rqd->max_depth = 2;
+ ret = true;
+ }
+ } else {
+ /*
+ * scale_step == 0 is our default state. If we have suffered
+ * latency spikes, step will be > 0, and we shrink the
+ * allowed write depths. If step is < 0, we're only doing
+ * writes, and we allow a temporarily higher depth to
+ * increase performance.
+ */
+ depth = min_t(unsigned int, rqd->default_depth,
+ rqd->queue_depth);
+ if (rqd->scale_step > 0)
+ depth = 1 + ((depth - 1) >> min(31, rqd->scale_step));
+ else if (rqd->scale_step < 0) {
+ unsigned int maxd = 3 * rqd->queue_depth / 4;
+
+ depth = 1 + ((depth - 1) << -rqd->scale_step);
+ if (depth > maxd) {
+ depth = maxd;
+ ret = true;
+ }
+ }
+
+ rqd->max_depth = depth;
+ }
+
+ return ret;
+}
+
+void rq_depth_scale_up(struct rq_depth *rqd)
+{
+ /*
+ * Hit max in previous round, stop here
+ */
+ if (rqd->scaled_max)
+ return;
+
+ rqd->scale_step--;
+
+ rqd->scaled_max = rq_depth_calc_max_depth(rqd);
+}
+
+/*
+ * Scale rwb down. If 'hard_throttle' is set, do it quicker, since we
+ * had a latency violation.
+ */
+void rq_depth_scale_down(struct rq_depth *rqd, bool hard_throttle)
+{
+ /*
+ * Stop scaling down when we've hit the limit. This also prevents
+ * ->scale_step from going to crazy values, if the device can't
+ * keep up.
+ */
+ if (rqd->max_depth == 1)
+ return;
+
+ if (rqd->scale_step < 0 && hard_throttle)
+ rqd->scale_step = 0;
+ else
+ rqd->scale_step++;
+
+ rqd->scaled_max = false;
+ rq_depth_calc_max_depth(rqd);
+}
+
+void rq_qos_exit(struct request_queue *q)
+{
+ while (q->rq_qos) {
+ struct rq_qos *rqos = q->rq_qos;
+ q->rq_qos = rqos->next;
+ rqos->ops->exit(rqos);
+ }
+}
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
new file mode 100644
index 000000000000..a321e9e2ba70
--- /dev/null
+++ b/block/blk-rq-qos.h
@@ -0,0 +1,91 @@
+#ifndef RQ_QOS_H
+#define RQ_QOS_H
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk_types.h>
+#include <linux/atomic.h>
+#include <linux/wait.h>
+
+enum rq_qos_id {
+ RQ_QOS_WBT,
+ RQ_QOS_CGROUP,
+};
+
+struct rq_wait {
+ wait_queue_head_t wait;
+ atomic_t inflight;
+};
+
+struct rq_qos {
+ struct rq_qos_ops *ops;
+ struct request_queue *q;
+ enum rq_qos_id id;
+ struct rq_qos *next;
+};
+
+struct rq_qos_ops {
+ enum wbt_flags (*throttle)(struct rq_qos *, struct bio *,
+ spinlock_t *);
+ void (*issue)(struct rq_qos *, struct request *);
+ void (*requeue)(struct rq_qos *, struct request *);
+ void (*done)(struct rq_qos *, struct request *);
+ void (*cleanup)(struct rq_qos *, enum wbt_flags);
+ void (*exit)(struct rq_qos *);
+};
+
+struct rq_depth {
+ unsigned int max_depth;
+
+ int scale_step;
+ bool scaled_max;
+
+ unsigned int queue_depth;
+ unsigned int default_depth;
+};
+
+static inline struct rq_qos *rq_qos_id(struct request_queue *q,
+ enum rq_qos_id id)
+{
+ struct rq_qos *rqos;
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->id == id)
+ break;
+ }
+ return rqos;
+}
+
+static inline struct rq_qos *wbt_rq_qos(struct request_queue *q)
+{
+ return rq_qos_id(q, RQ_QOS_WBT);
+}
+
+static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
+{
+ return rq_qos_id(q, RQ_QOS_CGROUP);
+}
+
+static inline void rq_wait_init(struct rq_wait *rq_wait)
+{
+ atomic_set(&rq_wait->inflight, 0);
+ init_waitqueue_head(&rq_wait->wait);
+}
+
+static inline void rq_qos_add(struct request_queue *q, struct rq_qos *rqos)
+{
+ rqos->next = q->rq_qos;
+ q->rq_qos = rqos;
+}
+
+bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit);
+void rq_depth_scale_up(struct rq_depth *rqd);
+void rq_depth_scale_down(struct rq_depth *rqd, bool hard_throttle);
+bool rq_depth_calc_max_depth(struct rq_depth *rqd);
+
+void rq_qos_cleanup(struct request_queue *, enum wbt_flags);
+void rq_qos_done(struct request_queue *, struct request *);
+void rq_qos_issue(struct request_queue *, struct request *);
+void rq_qos_requeue(struct request_queue *, struct request *);
+enum wbt_flags rq_qos_throttle(struct request_queue *, struct bio *, spinlock_t *);
+void rq_qos_exit(struct request_queue *);
+#endif
diff --git a/block/blk-settings.c b/block/blk-settings.c
index d1de71124656..053de87d1fda 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -875,7 +875,7 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
{
q->queue_depth = depth;
- wbt_set_queue_depth(q->rq_wb, depth);
+ wbt_set_queue_depth(q, depth);
}
EXPORT_SYMBOL(blk_set_queue_depth);

@@ -900,7 +900,7 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
queue_flag_clear(QUEUE_FLAG_FUA, q);
spin_unlock_irq(q->queue_lock);

- wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
+ wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
}
EXPORT_SYMBOL_GPL(blk_queue_write_cache);

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index cae525b7aae6..6c23404c01db 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -422,16 +422,16 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,

static ssize_t queue_wb_lat_show(struct request_queue *q, char *page)
{
- if (!q->rq_wb)
+ if (!wbt_rq_qos(q))
return -EINVAL;

- return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000));
+ return sprintf(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000));
}

static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
size_t count)
{
- struct rq_wb *rwb;
+ struct rq_qos *rqos;
ssize_t ret;
s64 val;

@@ -441,23 +441,21 @@ static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
if (val < -1)
return -EINVAL;

- rwb = q->rq_wb;
- if (!rwb) {
+ rqos = wbt_rq_qos(q);
+ if (!rqos) {
ret = wbt_init(q);
if (ret)
return ret;
}

- rwb = q->rq_wb;
if (val == -1)
- rwb->min_lat_nsec = wbt_default_latency_nsec(q);
+ val = wbt_default_latency_nsec(q);
else if (val >= 0)
- rwb->min_lat_nsec = val * 1000ULL;
+ val *= 1000ULL;

- if (rwb->enable_state == WBT_STATE_ON_DEFAULT)
- rwb->enable_state = WBT_STATE_ON_MANUAL;
+ wbt_set_min_lat(q, val);

- wbt_update_limits(rwb);
+ wbt_update_limits(q);
return count;
}

@@ -965,7 +963,7 @@ void blk_unregister_queue(struct gendisk *disk)
kobject_del(&q->kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));

- wbt_exit(q);
+ rq_qos_exit(q);

mutex_lock(&q->sysfs_lock);
if (q->request_fn || (q->mq_ops && q->elevator))
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 4f89b28fa652..6fe20fb823e4 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -25,6 +25,7 @@
#include <linux/swap.h>

#include "blk-wbt.h"
+#include "blk-rq-qos.h"

#define CREATE_TRACE_POINTS
#include <trace/events/wbt.h>
@@ -78,28 +79,6 @@ static inline bool rwb_enabled(struct rq_wb *rwb)
return rwb && rwb->wb_normal != 0;
}

-/*
- * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
- * false if 'v' + 1 would be bigger than 'below'.
- */
-static bool atomic_inc_below(atomic_t *v, int below)
-{
- int cur = atomic_read(v);
-
- for (;;) {
- int old;
-
- if (cur >= below)
- return false;
- old = atomic_cmpxchg(v, cur, cur + 1);
- if (old == cur)
- break;
- cur = old;
- }
-
- return true;
-}
-
static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
{
if (rwb_enabled(rwb)) {
@@ -116,7 +95,7 @@ static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
*/
static bool wb_recent_wait(struct rq_wb *rwb)
{
- struct bdi_writeback *wb = &rwb->queue->backing_dev_info->wb;
+ struct bdi_writeback *wb = &rwb->rqos.q->backing_dev_info->wb;

return time_before(jiffies, wb->dirty_sleep + HZ);
}
@@ -144,8 +123,9 @@ static void rwb_wake_all(struct rq_wb *rwb)
}
}

-void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
+static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
{
+ struct rq_wb *rwb = RQWB(rqos);
struct rq_wait *rqw;
int inflight, limit;

@@ -194,10 +174,9 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
* Called on completion of a request. Note that it's also called when
* a request is merged, when the request gets freed.
*/
-void wbt_done(struct rq_wb *rwb, struct request *rq)
+static void wbt_done(struct rq_qos *rqos, struct request *rq)
{
- if (!rwb)
- return;
+ struct rq_wb *rwb = RQWB(rqos);

if (!wbt_is_tracked(rq)) {
if (rwb->sync_cookie == rq) {
@@ -209,72 +188,11 @@ void wbt_done(struct rq_wb *rwb, struct request *rq)
wb_timestamp(rwb, &rwb->last_comp);
} else {
WARN_ON_ONCE(rq == rwb->sync_cookie);
- __wbt_done(rwb, wbt_flags(rq));
+ __wbt_done(rqos, wbt_flags(rq));
}
wbt_clear_state(rq);
}

-/*
- * Return true, if we can't increase the depth further by scaling
- */
-static bool calc_wb_limits(struct rq_wb *rwb)
-{
- unsigned int depth;
- bool ret = false;
-
- if (!rwb->min_lat_nsec) {
- rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
- return false;
- }
-
- /*
- * For QD=1 devices, this is a special case. It's important for those
- * to have one request ready when one completes, so force a depth of
- * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
- * since the device can't have more than that in flight. If we're
- * scaling down, then keep a setting of 1/1/1.
- */
- if (rwb->queue_depth == 1) {
- if (rwb->scale_step > 0)
- rwb->wb_max = rwb->wb_normal = 1;
- else {
- rwb->wb_max = rwb->wb_normal = 2;
- ret = true;
- }
- rwb->wb_background = 1;
- } else {
- /*
- * scale_step == 0 is our default state. If we have suffered
- * latency spikes, step will be > 0, and we shrink the
- * allowed write depths. If step is < 0, we're only doing
- * writes, and we allow a temporarily higher depth to
- * increase performance.
- */
- depth = min_t(unsigned int, RWB_DEF_DEPTH, rwb->queue_depth);
- if (rwb->scale_step > 0)
- depth = 1 + ((depth - 1) >> min(31, rwb->scale_step));
- else if (rwb->scale_step < 0) {
- unsigned int maxd = 3 * rwb->queue_depth / 4;
-
- depth = 1 + ((depth - 1) << -rwb->scale_step);
- if (depth > maxd) {
- depth = maxd;
- ret = true;
- }
- }
-
- /*
- * Set our max/normal/bg queue depths based on how far
- * we have scaled down (->scale_step).
- */
- rwb->wb_max = depth;
- rwb->wb_normal = (rwb->wb_max + 1) / 2;
- rwb->wb_background = (rwb->wb_max + 3) / 4;
- }
-
- return ret;
-}
-
static inline bool stat_sample_valid(struct blk_rq_stat *stat)
{
/*
@@ -307,7 +225,8 @@ enum {

static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
{
- struct backing_dev_info *bdi = rwb->queue->backing_dev_info;
+ struct backing_dev_info *bdi = rwb->rqos.q->backing_dev_info;
+ struct rq_depth *rqd = &rwb->rq_depth;
u64 thislat;

/*
@@ -351,7 +270,7 @@ static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
return LAT_EXCEEDED;
}

- if (rwb->scale_step)
+ if (rqd->scale_step)
trace_wbt_stat(bdi, stat);

return LAT_OK;
@@ -359,58 +278,48 @@ static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)

static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
{
- struct backing_dev_info *bdi = rwb->queue->backing_dev_info;
+ struct backing_dev_info *bdi = rwb->rqos.q->backing_dev_info;
+ struct rq_depth *rqd = &rwb->rq_depth;

- trace_wbt_step(bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
- rwb->wb_background, rwb->wb_normal, rwb->wb_max);
+ trace_wbt_step(bdi, msg, rqd->scale_step, rwb->cur_win_nsec,
+ rwb->wb_background, rwb->wb_normal, rqd->max_depth);
}

-static void scale_up(struct rq_wb *rwb)
+static void calc_wb_limits(struct rq_wb *rwb)
{
- /*
- * Hit max in previous round, stop here
- */
- if (rwb->scaled_max)
- return;
+ if (rwb->min_lat_nsec == 0) {
+ rwb->wb_normal = rwb->wb_background = 0;
+ } else if (rwb->rq_depth.max_depth <= 2) {
+ rwb->wb_normal = rwb->rq_depth.max_depth;
+ rwb->wb_background = 1;
+ } else {
+ rwb->wb_normal = (rwb->rq_depth.max_depth + 1) / 2;
+ rwb->wb_background = (rwb->rq_depth.max_depth + 3) / 4;
+ }
+}

- rwb->scale_step--;
+static void scale_up(struct rq_wb *rwb)
+{
+ rq_depth_scale_up(&rwb->rq_depth);
+ calc_wb_limits(rwb);
rwb->unknown_cnt = 0;
-
- rwb->scaled_max = calc_wb_limits(rwb);
-
- rwb_wake_all(rwb);
-
- rwb_trace_step(rwb, "step up");
+ rwb_trace_step(rwb, "scale up");
}

-/*
- * Scale rwb down. If 'hard_throttle' is set, do it quicker, since we
- * had a latency violation.
- */
static void scale_down(struct rq_wb *rwb, bool hard_throttle)
{
- /*
- * Stop scaling down when we've hit the limit. This also prevents
- * ->scale_step from going to crazy values, if the device can't
- * keep up.
- */
- if (rwb->wb_max == 1)
- return;
-
- if (rwb->scale_step < 0 && hard_throttle)
- rwb->scale_step = 0;
- else
- rwb->scale_step++;
-
- rwb->scaled_max = false;
- rwb->unknown_cnt = 0;
+ rq_depth_scale_down(&rwb->rq_depth, hard_throttle);
calc_wb_limits(rwb);
- rwb_trace_step(rwb, "step down");
+ rwb->unknown_cnt = 0;
+ rwb_wake_all(rwb);
+ rwb_trace_step(rwb, "scale down");
}

static void rwb_arm_timer(struct rq_wb *rwb)
{
- if (rwb->scale_step > 0) {
+ struct rq_depth *rqd = &rwb->rq_depth;
+
+ if (rqd->scale_step > 0) {
/*
* We should speed this up, using some variant of a fast
* integer inverse square root calculation. Since we only do
@@ -418,7 +327,7 @@ static void rwb_arm_timer(struct rq_wb *rwb)
* though.
*/
rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
- int_sqrt((rwb->scale_step + 1) << 8));
+ int_sqrt((rqd->scale_step + 1) << 8));
} else {
/*
* For step < 0, we don't want to increase/decrease the
@@ -433,12 +342,13 @@ static void rwb_arm_timer(struct rq_wb *rwb)
static void wb_timer_fn(struct blk_stat_callback *cb)
{
struct rq_wb *rwb = cb->data;
+ struct rq_depth *rqd = &rwb->rq_depth;
unsigned int inflight = wbt_inflight(rwb);
int status;

status = latency_exceeded(rwb, cb->stat);

- trace_wbt_timer(rwb->queue->backing_dev_info, status, rwb->scale_step,
+ trace_wbt_timer(rwb->rqos.q->backing_dev_info, status, rqd->scale_step,
inflight);

/*
@@ -469,9 +379,9 @@ static void wb_timer_fn(struct blk_stat_callback *cb)
* currently don't have a valid read/write sample. For that
* case, slowly return to center state (step == 0).
*/
- if (rwb->scale_step > 0)
+ if (rqd->scale_step > 0)
scale_up(rwb);
- else if (rwb->scale_step < 0)
+ else if (rqd->scale_step < 0)
scale_down(rwb, false);
break;
default:
@@ -481,19 +391,50 @@ static void wb_timer_fn(struct blk_stat_callback *cb)
/*
* Re-arm timer, if we have IO in flight
*/
- if (rwb->scale_step || inflight)
+ if (rqd->scale_step || inflight)
rwb_arm_timer(rwb);
}

-void wbt_update_limits(struct rq_wb *rwb)
+static void __wbt_update_limits(struct rq_wb *rwb)
{
- rwb->scale_step = 0;
- rwb->scaled_max = false;
+ struct rq_depth *rqd = &rwb->rq_depth;
+
+ rqd->scale_step = 0;
+ rqd->scaled_max = false;
+
+ rq_depth_calc_max_depth(rqd);
calc_wb_limits(rwb);

rwb_wake_all(rwb);
}

+void wbt_update_limits(struct request_queue *q)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (!rqos)
+ return;
+ __wbt_update_limits(RQWB(rqos));
+}
+
+u64 wbt_get_min_lat(struct request_queue *q)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (!rqos)
+ return 0;
+ return RQWB(rqos)->min_lat_nsec;
+}
+
+void wbt_set_min_lat(struct request_queue *q, u64 val)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (!rqos)
+ return;
+ RQWB(rqos)->min_lat_nsec = val;
+ RQWB(rqos)->enable_state = WBT_STATE_ON_MANUAL;
+ __wbt_update_limits(RQWB(rqos));
+}
+
+
static bool close_io(struct rq_wb *rwb)
{
const unsigned long now = jiffies;
@@ -520,7 +461,7 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
* IO for a bit.
*/
if ((rw & REQ_HIPRIO) || wb_recent_wait(rwb) || current_is_kswapd())
- limit = rwb->wb_max;
+ limit = rwb->rq_depth.max_depth;
else if ((rw & REQ_BACKGROUND) || close_io(rwb)) {
/*
* If less than 100ms since we completed unrelated IO,
@@ -554,7 +495,7 @@ static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
rqw->wait.head.next != &wait->entry)
return false;

- return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));
+ return rq_wait_inc_below(rqw, get_limit(rwb, rw));
}

/*
@@ -614,8 +555,10 @@ static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
* in an irq held spinlock, if it holds one when calling this function.
* If we do sleep, we'll release and re-grab it.
*/
-enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
+static enum wbt_flags wbt_wait(struct rq_qos *rqos, struct bio *bio,
+ spinlock_t *lock)
{
+ struct rq_wb *rwb = RQWB(rqos);
enum wbt_flags ret = 0;

if (!rwb_enabled(rwb))
@@ -643,8 +586,10 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
return ret | WBT_TRACKED;
}

-void wbt_issue(struct rq_wb *rwb, struct request *rq)
+void wbt_issue(struct rq_qos *rqos, struct request *rq)
{
+ struct rq_wb *rwb = RQWB(rqos);
+
if (!rwb_enabled(rwb))
return;

@@ -661,8 +606,9 @@ void wbt_issue(struct rq_wb *rwb, struct request *rq)
}
}

-void wbt_requeue(struct rq_wb *rwb, struct request *rq)
+void wbt_requeue(struct rq_qos *rqos, struct request *rq)
{
+ struct rq_wb *rwb = RQWB(rqos);
if (!rwb_enabled(rwb))
return;
if (rq == rwb->sync_cookie) {
@@ -671,39 +617,30 @@ void wbt_requeue(struct rq_wb *rwb, struct request *rq)
}
}

-void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+void wbt_set_queue_depth(struct request_queue *q, unsigned int depth)
{
- if (rwb) {
- rwb->queue_depth = depth;
- wbt_update_limits(rwb);
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (rqos) {
+ RQWB(rqos)->rq_depth.queue_depth = depth;
+ __wbt_update_limits(RQWB(rqos));
}
}

-void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on)
-{
- if (rwb)
- rwb->wc = write_cache_on;
-}
-
-/*
- * Disable wbt, if enabled by default.
- */
-void wbt_disable_default(struct request_queue *q)
+void wbt_set_write_cache(struct request_queue *q, bool write_cache_on)
{
- struct rq_wb *rwb = q->rq_wb;
-
- if (rwb && rwb->enable_state == WBT_STATE_ON_DEFAULT)
- wbt_exit(q);
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (rqos)
+ RQWB(rqos)->wc = write_cache_on;
}
-EXPORT_SYMBOL_GPL(wbt_disable_default);

/*
* Enable wbt if defaults are configured that way
*/
void wbt_enable_default(struct request_queue *q)
{
+ struct rq_qos *rqos = wbt_rq_qos(q);
/* Throttling already enabled? */
- if (q->rq_wb)
+ if (rqos)
return;

/* Queue not registered? Maybe shutting down... */
@@ -741,6 +678,41 @@ static int wbt_data_dir(const struct request *rq)
return -1;
}

+static void wbt_exit(struct rq_qos *rqos)
+{
+ struct rq_wb *rwb = RQWB(rqos);
+ struct request_queue *q = rqos->q;
+
+ blk_stat_remove_callback(q, rwb->cb);
+ blk_stat_free_callback(rwb->cb);
+ kfree(rwb);
+}
+
+/*
+ * Disable wbt, if enabled by default.
+ */
+void wbt_disable_default(struct request_queue *q)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ struct rq_wb *rwb;
+ if (!rqos)
+ return;
+ rwb = RQWB(rqos);
+ if (rwb->enable_state == WBT_STATE_ON_DEFAULT)
+ rwb->wb_normal = 0;
+}
+EXPORT_SYMBOL_GPL(wbt_disable_default);
+
+
+static struct rq_qos_ops wbt_rqos_ops = {
+ .throttle = wbt_wait,
+ .issue = wbt_issue,
+ .requeue = wbt_requeue,
+ .done = wbt_done,
+ .cleanup = __wbt_done,
+ .exit = wbt_exit,
+};
+
int wbt_init(struct request_queue *q)
{
struct rq_wb *rwb;
@@ -756,39 +728,29 @@ int wbt_init(struct request_queue *q)
return -ENOMEM;
}

- for (i = 0; i < WBT_NUM_RWQ; i++) {
- atomic_set(&rwb->rq_wait[i].inflight, 0);
- init_waitqueue_head(&rwb->rq_wait[i].wait);
- }
+ for (i = 0; i < WBT_NUM_RWQ; i++)
+ rq_wait_init(&rwb->rq_wait[i]);

+ rwb->rqos.id = RQ_QOS_WBT;
+ rwb->rqos.ops = &wbt_rqos_ops;
+ rwb->rqos.q = q;
rwb->last_comp = rwb->last_issue = jiffies;
- rwb->queue = q;
rwb->win_nsec = RWB_WINDOW_NSEC;
rwb->enable_state = WBT_STATE_ON_DEFAULT;
- wbt_update_limits(rwb);
+ rwb->wc = 1;
+ rwb->rq_depth.default_depth = RWB_DEF_DEPTH;
+ __wbt_update_limits(rwb);

/*
* Assign rwb and add the stats callback.
*/
- q->rq_wb = rwb;
+ rq_qos_add(q, &rwb->rqos);
blk_stat_add_callback(q, rwb->cb);

rwb->min_lat_nsec = wbt_default_latency_nsec(q);

- wbt_set_queue_depth(rwb, blk_queue_depth(q));
- wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
+ wbt_set_queue_depth(q, blk_queue_depth(q));
+ wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags));

return 0;
}
-
-void wbt_exit(struct request_queue *q)
-{
- struct rq_wb *rwb = q->rq_wb;
-
- if (rwb) {
- blk_stat_remove_callback(q, rwb->cb);
- blk_stat_free_callback(rwb->cb);
- q->rq_wb = NULL;
- kfree(rwb);
- }
-}
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 300df531d0a6..53b20a58c0a2 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -9,6 +9,7 @@
#include <linux/ktime.h>

#include "blk-stat.h"
+#include "blk-rq-qos.h"

enum wbt_flags {
WBT_TRACKED = 1, /* write, tracked for throttling */
@@ -35,20 +36,12 @@ enum {
WBT_STATE_ON_MANUAL = 2,
};

-struct rq_wait {
- wait_queue_head_t wait;
- atomic_t inflight;
-};
-
struct rq_wb {
/*
* Settings that govern how we throttle
*/
unsigned int wb_background; /* background writeback */
unsigned int wb_normal; /* normal writeback */
- unsigned int wb_max; /* max throughput writeback */
- int scale_step;
- bool scaled_max;

short enable_state; /* WBT_STATE_* */

@@ -67,15 +60,20 @@ struct rq_wb {
void *sync_cookie;

unsigned int wc;
- unsigned int queue_depth;

unsigned long last_issue; /* last non-throttled issue */
unsigned long last_comp; /* last non-throttled comp */
unsigned long min_lat_nsec;
- struct request_queue *queue;
+ struct rq_qos rqos;
struct rq_wait rq_wait[WBT_NUM_RWQ];
+ struct rq_depth rq_depth;
};

+static inline struct rq_wb *RQWB(struct rq_qos *rqos)
+{
+ return container_of(rqos, struct rq_wb, rqos);
+}
+
static inline unsigned int wbt_inflight(struct rq_wb *rwb)
{
unsigned int i, ret = 0;
@@ -86,6 +84,7 @@ static inline unsigned int wbt_inflight(struct rq_wb *rwb)
return ret;
}

+
#ifdef CONFIG_BLK_WBT

static inline void wbt_track(struct request *rq, enum wbt_flags flags)
@@ -93,19 +92,16 @@ static inline void wbt_track(struct request *rq, enum wbt_flags flags)
rq->wbt_flags |= flags;
}

-void __wbt_done(struct rq_wb *, enum wbt_flags);
-void wbt_done(struct rq_wb *, struct request *);
-enum wbt_flags wbt_wait(struct rq_wb *, struct bio *, spinlock_t *);
int wbt_init(struct request_queue *);
-void wbt_exit(struct request_queue *);
-void wbt_update_limits(struct rq_wb *);
-void wbt_requeue(struct rq_wb *, struct request *);
-void wbt_issue(struct rq_wb *, struct request *);
+void wbt_update_limits(struct request_queue *);
void wbt_disable_default(struct request_queue *);
void wbt_enable_default(struct request_queue *);

-void wbt_set_queue_depth(struct rq_wb *, unsigned int);
-void wbt_set_write_cache(struct rq_wb *, bool);
+u64 wbt_get_min_lat(struct request_queue *q);
+void wbt_set_min_lat(struct request_queue *q, u64 val);
+
+void wbt_set_queue_depth(struct request_queue *, unsigned int);
+void wbt_set_write_cache(struct request_queue *, bool);

u64 wbt_default_latency_nsec(struct request_queue *);

@@ -114,43 +110,30 @@ u64 wbt_default_latency_nsec(struct request_queue *);
static inline void wbt_track(struct request *rq, enum wbt_flags flags)
{
}
-static inline void __wbt_done(struct rq_wb *rwb, enum wbt_flags flags)
-{
-}
-static inline void wbt_done(struct rq_wb *rwb, struct request *rq)
-{
-}
-static inline enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio,
- spinlock_t *lock)
-{
- return 0;
-}
static inline int wbt_init(struct request_queue *q)
{
return -EINVAL;
}
-static inline void wbt_exit(struct request_queue *q)
-{
-}
-static inline void wbt_update_limits(struct rq_wb *rwb)
+static inline void wbt_update_limits(struct request_queue *q)
{
}
-static inline void wbt_requeue(struct rq_wb *rwb, struct request *rq)
+static inline void wbt_disable_default(struct request_queue *q)
{
}
-static inline void wbt_issue(struct rq_wb *rwb, struct request *rq)
+static inline void wbt_enable_default(struct request_queue *q)
{
}
-static inline void wbt_disable_default(struct request_queue *q)
+static inline void wbt_set_queue_depth(struct request_queue *q, unsigned int depth)
{
}
-static inline void wbt_enable_default(struct request_queue *q)
+static inline void wbt_set_write_cache(struct request_queue *q, bool wc)
{
}
-static inline void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+static inline u64 wbt_get_min_lat(struct request_queue *q)
{
+ return 0;
}
-static inline void wbt_set_write_cache(struct rq_wb *rwb, bool wc)
+static inline void wbt_set_min_lat(struct request_queue *q, u64 val)
{
}
static inline u64 wbt_default_latency_nsec(struct request_queue *q)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3999719f828..30c60d049167 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -42,7 +42,7 @@ struct bsg_job;
struct blkcg_gq;
struct blk_flush_queue;
struct pr_ops;
-struct rq_wb;
+struct rq_qos;
struct blk_queue_stats;
struct blk_stat_callback;

@@ -455,7 +455,7 @@ struct request_queue {
atomic_t shared_hctx_restart;

struct blk_queue_stats *stats;
- struct rq_wb *rq_wb;
+ struct rq_qos *rq_qos;

/*
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
--
2.14.3


2018-06-05 13:34:07

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 04/13] blk: introduce REQ_SWAP

From: Josef Bacik <[email protected]>

Just like REQ_META, it's important to know the IO coming down is swap
in order to guard against potential IO priority inversion issues with
cgroups. Add REQ_SWAP and use it for all swap IO, and add it to our
bio_issue_as_root_blkg helper.

Signed-off-by: Josef Bacik <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
include/linux/blk-cgroup.h | 2 +-
include/linux/blk_types.h | 3 ++-
mm/page_io.c | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index b41292726c0f..a8f9ba8f33a4 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -254,7 +254,7 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
*/
static inline bool bio_issue_as_root_blkg(struct bio *bio)
{
- return (bio->bi_opf & REQ_META);
+ return (bio->bi_opf & (REQ_META | REQ_SWAP)) != 0;
}

/**
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ff181df8c195..4f7ba397d37d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -327,7 +327,7 @@ enum req_flag_bits {

/* for driver use */
__REQ_DRV,
-
+ __REQ_SWAP, /* swapping request. */
__REQ_NR_BITS, /* stops here */
};

@@ -349,6 +349,7 @@ enum req_flag_bits {
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)

#define REQ_DRV (1ULL << __REQ_DRV)
+#define REQ_SWAP (1ULL << __REQ_SWAP)

#define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
diff --git a/mm/page_io.c b/mm/page_io.c
index b41cf9644585..a552cb37e220 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -338,7 +338,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
ret = -ENOMEM;
goto out;
}
- bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+ bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc);
count_swpout_vm_event(page);
set_page_writeback(page);
unlock_page(page);
--
2.14.3


2018-06-05 13:34:45

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 07/13] memcontrol: schedule throttling if we are congested

From: Tejun Heo <[email protected]>

Memory allocations can induce swapping via kswapd or direct reclaim. If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling. So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 13 +++++++++++++
include/linux/swap.h | 8 ++++++++
mm/huge_memory.c | 6 +++---
mm/memcontrol.c | 13 +++++++++++++
mm/memory.c | 11 ++++++-----
mm/shmem.c | 10 +++++-----
mm/swapfile.c | 24 ++++++++++++++++++++++++
7 files changed, 72 insertions(+), 13 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71bc2c66..4d2e7f35f2dc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -290,6 +290,9 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **memcgp,
bool compound);
+int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **memcgp,
+ bool compound);
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
bool lrucare, bool compound);
void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
@@ -745,6 +748,16 @@ static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
return 0;
}

+static inline int mem_cgroup_try_charge_delay(struct page *page,
+ struct mm_struct *mm,
+ gfp_t gfp_mask,
+ struct mem_cgroup **memcgp,
+ bool compound)
+{
+ *memcgp = NULL;
+ return 0;
+}
+
static inline void mem_cgroup_commit_charge(struct page *page,
struct mem_cgroup *memcg,
bool lrucare, bool compound)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..7ba0f52496e0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -630,11 +630,19 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+extern void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
+ gfp_t gfp_mask);
#else
static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
{
return vm_swappiness;
}
+
+static inline void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg,
+ int node, gfp_t gfP_maks)
+{
+}
+
#endif

#ifdef CONFIG_MEMCG_SWAP
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a3a1815f8e11..9812ddad9961 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,7 +555,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
+ if (mem_cgroup_try_charge_delay(page, vma->vm_mm, gfp, &memcg, true)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
@@ -1145,7 +1145,7 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE, vma,
vmf->address, page_to_nid(page));
if (unlikely(!pages[i] ||
- mem_cgroup_try_charge(pages[i], vma->vm_mm,
+ mem_cgroup_try_charge_delay(pages[i], vma->vm_mm,
GFP_KERNEL, &memcg, false))) {
if (pages[i])
put_page(pages[i]);
@@ -1315,7 +1315,7 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
goto out;
}

- if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
+ if (unlikely(mem_cgroup_try_charge_delay(new_page, vma->vm_mm,
huge_gfp, &memcg, true))) {
put_page(new_page);
split_huge_pmd(vma, vmf->pmd, vmf->address);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..5fffd28477c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5458,6 +5458,19 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
return ret;
}

+int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask, struct mem_cgroup **memcgp,
+ bool compound)
+{
+ struct mem_cgroup *memcg;
+ int ret;
+
+ ret = mem_cgroup_try_charge(page, mm, gfp_mask, memcgp, compound);
+ memcg = *memcgp;
+ mem_cgroup_throttle_swaprate(memcg, page_to_nid(page), gfp_mask);
+ return ret;
+}
+
/**
* mem_cgroup_commit_charge - commit a page charge
* @page: page to charge
diff --git a/mm/memory.c b/mm/memory.c
index 01f5464e0fd2..d0eea6d33b18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2494,7 +2494,7 @@ static int wp_page_copy(struct vm_fault *vmf)
cow_user_page(new_page, old_page, vmf->address, vma);
}

- if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
+ if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg, false))
goto oom_free_new;

__SetPageUptodate(new_page);
@@ -2994,8 +2994,8 @@ int do_swap_page(struct vm_fault *vmf)
goto out_page;
}

- if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
- &memcg, false)) {
+ if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
ret = VM_FAULT_OOM;
goto out_page;
}
@@ -3156,7 +3156,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
if (!page)
goto oom;

- if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+ if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL, &memcg,
+ false))
goto oom_free_page;

/*
@@ -3652,7 +3653,7 @@ static int do_cow_fault(struct vm_fault *vmf)
if (!vmf->cow_page)
return VM_FAULT_OOM;

- if (mem_cgroup_try_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL,
+ if (mem_cgroup_try_charge_delay(vmf->cow_page, vma->vm_mm, GFP_KERNEL,
&vmf->memcg, false)) {
put_page(vmf->cow_page);
return VM_FAULT_OOM;
diff --git a/mm/shmem.c b/mm/shmem.c
index 9d6c7e595415..a96af5690864 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1219,8 +1219,8 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
* the shmem_swaplist_mutex which might hold up shmem_writepage().
* Charged back to the user (not to caller) when swap account is used.
*/
- error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg,
- false);
+ error = mem_cgroup_try_charge_delay(page, current->mm, GFP_KERNEL,
+ &memcg, false);
if (error)
goto out;
/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1697,7 +1697,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
goto failed;
}

- error = mem_cgroup_try_charge(page, charge_mm, gfp, &memcg,
+ error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg,
false);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
@@ -1803,7 +1803,7 @@ alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, inode,
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

- error = mem_cgroup_try_charge(page, charge_mm, gfp, &memcg,
+ error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg,
PageTransHuge(page));
if (error)
goto unacct;
@@ -2276,7 +2276,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
__SetPageSwapBacked(page);
__SetPageUptodate(page);

- ret = mem_cgroup_try_charge(page, dst_mm, gfp, &memcg, false);
+ ret = mem_cgroup_try_charge_delay(page, dst_mm, gfp, &memcg, false);
if (ret)
goto out_release;

diff --git a/mm/swapfile.c b/mm/swapfile.c
index cc2cf04d9018..a42f86bc55ad 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3725,6 +3725,30 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
}
}

+#ifdef CONFIG_MEMCG
+void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
+ gfp_t gfp_mask)
+{
+ struct swap_info_struct *si, *next;
+ if (!(gfp_mask & __GFP_IO) || !memcg)
+ return;
+
+ if (atomic_read(&memcg->css.cgroup->congestion_count) == 0)
+ return;
+
+ spin_lock(&swap_avail_lock);
+ plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
+ avail_lists[node]) {
+ if (si->bdev) {
+ blkcg_schedule_throttle(bdev_get_queue(si->bdev),
+ true);
+ break;
+ }
+ }
+ spin_unlock(&swap_avail_lock);
+}
+#endif
+
static int __init swapfile_init(void)
{
int nid;
--
2.14.3


2018-06-05 13:34:52

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 03/13] blk-cgroup: allow controllers to output their own stats

From: Josef Bacik <[email protected]>

blk-iolatency has a few stats that it would like to print out, and
instead of adding a bunch of crap to the generic code just provide a
helper so that controllers can add stuff to the stat line if they want
to.

Hide it behind a boot option since it changes the output of io.stat from
normal, and these stats are only interesting to developers.

Signed-off-by: Josef Bacik <[email protected]>
---
block/blk-cgroup.c | 61 +++++++++++++++++++++++++++++++++++++++++++---
include/linux/blk-cgroup.h | 3 +++
2 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index eb85cb87c40f..b2ac6a7f851f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -50,6 +50,8 @@ static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];

static LIST_HEAD(all_blkcgs); /* protected by blkcg_pol_mutex */

+static bool blkcg_debug_stats = false;
+
static bool blkcg_policy_enabled(struct request_queue *q,
const struct blkcg_policy *pol)
{
@@ -954,13 +956,28 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)

hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
const char *dname;
+ char *buf;
struct blkg_rwstat rwstat;
u64 rbytes, wbytes, rios, wios;
+ size_t size = seq_get_buf(sf, &buf), count = 0, total = 0;
+ int i;

dname = blkg_dev_name(blkg);
if (!dname)
continue;

+ /*
+ * Hooray string manipulation, count is the size written NOT
+ * INCLUDING THE \0, so size is now count+1 less than what we
+ * had before, but we want to start writing the next bit from
+ * the \0 so we only add count to buf.
+ */
+ count = snprintf(buf, size, "%s ", dname);
+ if (count >= size)
+ continue;
+ buf += count;
+ size -= count + 1;
+
spin_lock_irq(blkg->q->queue_lock);

rwstat = blkg_rwstat_recursive_sum(blkg, NULL,
@@ -975,9 +992,44 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)

spin_unlock_irq(blkg->q->queue_lock);

- if (rbytes || wbytes || rios || wios)
- seq_printf(sf, "%s rbytes=%llu wbytes=%llu rios=%llu wios=%llu\n",
- dname, rbytes, wbytes, rios, wios);
+ if (rbytes || wbytes || rios || wios) {
+ total += count;
+ count = snprintf(buf, size,
+ "rbytes=%llu wbytes=%llu rios=%llu wios=%llu",
+ rbytes, wbytes, rios, wios);
+ if (count >= size)
+ continue;
+ buf += count;
+ total += count;
+ size -= count + 1;
+ }
+
+ if (!blkcg_debug_stats)
+ goto next;
+
+ mutex_lock(&blkcg_pol_mutex);
+ for (i = 0; i < BLKCG_MAX_POLS; i++) {
+ struct blkcg_policy *pol = blkcg_policy[i];
+
+ if (!blkg->pd[i] || !pol->pd_stat_fn)
+ continue;
+
+ count = pol->pd_stat_fn(blkg->pd[i], buf, size);
+ if (count >= size)
+ continue;
+ buf += count;
+ total += count;
+ size -= count + 1;
+ }
+ mutex_unlock(&blkcg_pol_mutex);
+next:
+ if (total) {
+ count = snprintf(buf, size, "\n");
+ if (count >= size)
+ continue;
+ total += count;
+ seq_commit(sf, total);
+ }
}

rcu_read_unlock();
@@ -1547,3 +1599,6 @@ void blkcg_policy_unregister(struct blkcg_policy *pol)
mutex_unlock(&blkcg_pol_register_mutex);
}
EXPORT_SYMBOL_GPL(blkcg_policy_unregister);
+
+module_param(blkcg_debug_stats, bool, 0444);
+MODULE_PARM_DESC(blkcg_debug_stats, "True if you want debug stats, false if not");
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 69aa71dc0c04..b41292726c0f 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -148,6 +148,8 @@ typedef void (blkcg_pol_online_pd_fn)(struct blkg_policy_data *pd);
typedef void (blkcg_pol_offline_pd_fn)(struct blkg_policy_data *pd);
typedef void (blkcg_pol_free_pd_fn)(struct blkg_policy_data *pd);
typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkg_policy_data *pd);
+typedef size_t (blkcg_pol_stat_pd_fn)(struct blkg_policy_data *pd, char *buf,
+ size_t size);

struct blkcg_policy {
int plid;
@@ -167,6 +169,7 @@ struct blkcg_policy {
blkcg_pol_offline_pd_fn *pd_offline_fn;
blkcg_pol_free_pd_fn *pd_free_fn;
blkcg_pol_reset_pd_stats_fn *pd_reset_stats_fn;
+ blkcg_pol_stat_pd_fn *pd_stat_fn;
};

extern struct blkcg blkcg_root;
--
2.14.3


2018-06-05 13:35:14

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 02/13] block: introduce bio_issue_as_root_blkg

From: Josef Bacik <[email protected]>

Instead of forcing all file systems to get the right context on their
bio's, simply check for REQ_META to see if we need to issue as the root
blkg. We don't want to force all bio's to have the root blkg associated
with them if REQ_META is set, as some controllers (blk-iolatency) need
to know who the originating cgroup is so it can backcharge them for the
work they are doing. This helper will make sure that the controllers do
the proper thing wrt the IO priority and backcharging.

Signed-off-by: Josef Bacik <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
include/linux/blk-cgroup.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 6c666fd7de3c..69aa71dc0c04 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -238,6 +238,22 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
return css_to_blkcg(task_css(current, io_cgrp_id));
}

+/**
+ * bio_issue_as_root_blkg - see if this bio needs to be issued as root blkg
+ * @return: true if this bio needs to be submitted with the root blkg context.
+ *
+ * In order to avoid priority inversions we sometimes need to issue a bio as if
+ * it were attached to the root blkg, and then backcharge to the actual owning
+ * blkg. The idea is we do bio_blkcg() to look up the actual context for the
+ * bio and attach the appropriate blkg to the bio. Then we call this helper and
+ * if it is true run with the root blkg for that queue and then do any
+ * backcharging to the originating cgroup once the io is complete.
+ */
+static inline bool bio_issue_as_root_blkg(struct bio *bio)
+{
+ return (bio->bi_opf & REQ_META);
+}
+
/**
* blkcg_parent - get the parent of a blkcg
* @blkcg: blkcg of interest
--
2.14.3


2018-06-05 13:36:31

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 01/13] block: add bi_blkg to the bio for cgroups

From: Josef Bacik <[email protected]>

Currently io.low uses a bi_cg_private to stash its private data for the
blkg, however other blkcg policies may want to use this as well. Since
we can get the private data out of the blkg, move this to bi_blkg in the
bio and make it generic, then we can use bio_associate_blkg() to attach
the blkg to the bio.

Theoretically we could simply replace the bi_css with this since we can
get to all the same information from the blkg, however you have to
lookup the blkg, so for example wbc_init_bio() would have to lookup and
possibly allocate the blkg for the css it was trying to attach to the
bio. This could be problematic and result in us either not attaching
the css at all to the bio, or falling back to the root blkcg if we are
unable to allocate the corresponding blkg.

So for now do this, and in the future if possible we could just replace
the bi_css with bi_blkg and update the helpers to do the correct
translation.

Signed-off-by: Josef Bacik <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
block/bio.c | 23 +++++++++++++++++++++++
block/blk-throttle.c | 21 +++++++--------------
include/linux/bio.h | 1 +
include/linux/blk_types.h | 2 +-
4 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 0a4df92cd689..57c4b1986e76 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -28,6 +28,7 @@
#include <linux/mempool.h>
#include <linux/workqueue.h>
#include <linux/cgroup.h>
+#include <linux/blk-cgroup.h>

#include <trace/events/block.h>
#include "blk.h"
@@ -2026,6 +2027,24 @@ int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
}
EXPORT_SYMBOL_GPL(bio_associate_blkcg);

+/**
+ * bio_associate_blkg - associate a bio with the specified blkg
+ * @bio: target bio
+ * @blkg: the blkg to associate
+ *
+ * Associate @bio with the blkg specified by @blkg. This is the queue specific
+ * blkcg information associated with the @bio, a reference will be taken on the
+ * @blkg and will be freed when the bio is freed.
+ */
+int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg)
+{
+ if (unlikely(bio->bi_blkg))
+ return -EBUSY;
+ blkg_get(blkg);
+ bio->bi_blkg = blkg;
+ return 0;
+}
+
/**
* bio_disassociate_task - undo bio_associate_current()
* @bio: target bio
@@ -2040,6 +2059,10 @@ void bio_disassociate_task(struct bio *bio)
css_put(bio->bi_css);
bio->bi_css = NULL;
}
+ if (bio->bi_blkg) {
+ blkg_put(bio->bi_blkg);
+ bio->bi_blkg = NULL;
+ }
}

/**
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f63d88c92c3a..5112cef3166b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2131,12 +2131,8 @@ static inline void throtl_update_latency_buckets(struct throtl_data *td)
static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio)
{
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
- if (bio->bi_css) {
- if (bio->bi_cg_private)
- blkg_put(tg_to_blkg(bio->bi_cg_private));
- bio->bi_cg_private = tg;
- blkg_get(tg_to_blkg(tg));
- }
+ if (bio->bi_css)
+ bio_associate_blkg(bio, tg_to_blkg(tg));
bio_issue_init(&bio->bi_issue, bio_sectors(bio));
#endif
}
@@ -2284,6 +2280,7 @@ void blk_throtl_stat_add(struct request *rq, u64 time_ns)

void blk_throtl_bio_endio(struct bio *bio)
{
+ struct blkcg_gq *blkg;
struct throtl_grp *tg;
u64 finish_time_ns;
unsigned long finish_time;
@@ -2291,20 +2288,18 @@ void blk_throtl_bio_endio(struct bio *bio)
unsigned long lat;
int rw = bio_data_dir(bio);

- tg = bio->bi_cg_private;
- if (!tg)
+ blkg = bio->bi_blkg;
+ if (!blkg)
return;
- bio->bi_cg_private = NULL;
+ tg = blkg_to_tg(blkg);

finish_time_ns = ktime_get_ns();
tg->last_finish_time = finish_time_ns >> 10;

start_time = bio_issue_time(&bio->bi_issue) >> 10;
finish_time = __bio_issue_time(finish_time_ns) >> 10;
- if (!start_time || finish_time <= start_time) {
- blkg_put(tg_to_blkg(tg));
+ if (!start_time || finish_time <= start_time)
return;
- }

lat = finish_time - start_time;
/* this is only for bio based driver */
@@ -2333,8 +2328,6 @@ void blk_throtl_bio_endio(struct bio *bio)
tg->bio_cnt /= 2;
tg->bad_bio_cnt /= 2;
}
-
- blkg_put(tg_to_blkg(tg));
}
#endif

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 98b175cc00d5..f2f3f1428e81 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -547,6 +547,7 @@ do { \

#ifdef CONFIG_BLK_CGROUP
int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css);
+int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg);
void bio_disassociate_task(struct bio *bio);
void bio_clone_blkcg_association(struct bio *dst, struct bio *src);
#else /* CONFIG_BLK_CGROUP */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4cb970cdcd11..ff181df8c195 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -179,8 +179,8 @@ struct bio {
*/
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
+ struct blkcg_gq *bi_blkg;
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
- void *bi_cg_private;
struct bio_issue bi_issue;
#endif
#endif
--
2.14.3


2018-06-05 19:42:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 05/13] swap,blkcg: issue swap io with the appropriate context

On Tue, Jun 05, 2018 at 09:29:40AM -0400, Josef Bacik wrote:
> From: Tejun Heo <[email protected]>
>
> For backcharging we need to know who the page belongs to when swapping
> it out.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Signed-off-by: Josef Bacik <[email protected]>

Acked-by: Tejun Heo <[email protected]>

I still think it might be worthwhile to mention that we're skipping
over the rw_page cases.

Thanks.

--
tejun

2018-06-05 20:47:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 06/13] blkcg: add generic throttling mechanism

On Tue, Jun 05, 2018 at 09:29:41AM -0400, Josef Bacik wrote:
> From: Josef Bacik <[email protected]>
>
> Since IO can be issued from literally anywhere it's almost impossible to
> do throttling without having some sort of adverse effect somewhere else
> in the system because of locking or other dependencies. The best way to
> solve this is to do the throttling when we know we aren't holding any
> other kernel resources. Do this by tracking throttling in a per-blkg
> basis, and if we require throttling flag the task that it needs to check
> before it returns to user space and possibly sleep there.
>
> This is to address the case where a process is doing work that is
> generating IO that can't be throttled, whether that is directly with a
> lot of REQ_META IO, or indirectly by allocating so much memory that it
> is swamping the disk with REQ_SWAP. We can't use task_add_work as we
> don't want to induce a memory allocation in the IO path, so simply
> saving the request queue in the task and flagging it to do the
> notify_resume thing achieves the same result without the overhead of a
> memory allocation.
>
> Signed-off-by: Josef Bacik <[email protected]>

Acked-by: Tejun Heo <[email protected]>

--
tejun

2018-06-05 20:47:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 07/13] memcontrol: schedule throttling if we are congested

On Tue, Jun 05, 2018 at 09:29:42AM -0400, Josef Bacik wrote:
> From: Tejun Heo <[email protected]>
>
> Memory allocations can induce swapping via kswapd or direct reclaim. If
> we are having IO done for us by kswapd and don't actually go into direct
> reclaim we may never get scheduled for throttling. So instead check to
> see if our cgroup is congested, and if so schedule the throttling.
> Before we return to user space the throttling stuff will only throttle
> if we actually required it.
>
> Signed-off-by: Tejun Heo <[email protected]>

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2018-06-11 14:05:20

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 05/13] swap,blkcg: issue swap io with the appropriate context

On Tue, Jun 05, 2018 at 09:29:40AM -0400, Josef Bacik wrote:
> From: Tejun Heo <[email protected]>
>
> For backcharging we need to know who the page belongs to when swapping
> it out.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Signed-off-by: Josef Bacik <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2018-06-11 14:32:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 07/13] memcontrol: schedule throttling if we are congested

On Tue, Jun 05, 2018 at 09:29:42AM -0400, Josef Bacik wrote:
> From: Tejun Heo <[email protected]>
>
> Memory allocations can induce swapping via kswapd or direct reclaim. If
> we are having IO done for us by kswapd and don't actually go into direct
> reclaim we may never get scheduled for throttling. So instead check to
> see if our cgroup is congested, and if so schedule the throttling.
> Before we return to user space the throttling stuff will only throttle
> if we actually required it.
>
> Signed-off-by: Tejun Heo <[email protected]>

Looks good to me now, thanks.

Acked-by: Johannes Weiner <[email protected]>