Hi,
Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:
$ dd if=/dev/zero of=foo bs=1M count=10k
on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.
I have posted plenty of results previously, I'll keep it shorter
this time. Here's a run on my laptop, using read-to-pipe-async for
reading a 5g file, and rewriting it.
4.6-rc3:
$ t/read-to-pipe-async -f ~/5g > 5g-new
Latency percentiles (usec) (READERS)
50.0000th: 2
75.0000th: 3
90.0000th: 5
95.0000th: 7
99.0000th: 43
99.5000th: 77
99.9000th: 9008
99.9900th: 91008
99.9990th: 286208
99.9999th: 347648
Over=1251, min=0, max=358081
Latency percentiles (usec) (WRITERS)
50.0000th: 4
75.0000th: 8
90.0000th: 13
95.0000th: 15
99.0000th: 32
99.5000th: 43
99.9000th: 81
99.9900th: 2372
99.9990th: 104320
99.9999th: 349696
Over=63, min=1, max=358321
Read rate (KB/sec) : 91859
Write rate (KB/sec): 91859
4.6-rc3 + wb-buf-throttle
Latency percentiles (usec) (READERS)
50.0000th: 2
75.0000th: 3
90.0000th: 5
95.0000th: 8
99.0000th: 48
99.5000th: 79
99.9000th: 5304
99.9900th: 22496
99.9990th: 29408
99.9999th: 33728
Over=860, min=0, max=37599
Latency percentiles (usec) (WRITERS)
50.0000th: 4
75.0000th: 9
90.0000th: 14
95.0000th: 16
99.0000th: 34
99.5000th: 45
99.9000th: 87
99.9900th: 1342
99.9990th: 13648
99.9999th: 21280
Over=29, min=1, max=30457
Read rate (KB/sec) : 95832
Write rate (KB/sec): 95832
Better throughput and tighter latencies, for both reads and writes.
That's hard not to like.
The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.
This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. It's all about managing
the queues on the hardware side. The big change in this version is that
it should be pretty much auto-tuning - you no longer have to set a
given percentage of writeback bandwidth. I've implemented something
similar to CoDel to manage the writeback queue. See the last patch
for a full description, but the tldr is that we monitor min latencies
over a window of time, and scale up/down the queue based on that. This
needs a minimum of tunables, and it stays out of the way, if your device
is fast enough. There's a single tunable now, wb_last_usec, that simply
sets this latency target. Most people won't have to touch this, it'll
work pretty well just being in the ballpark.
I welcome testing. If you are sick of Linux bogging down when buffered
writes are happening, then this is for you, laptop or server. The
patchset is fully stable, I have not observed problems. It passes full
xfstest runs, and a variety of benchmarks as well. It works equally well
on blk-mq/scsi-mq, and "classic" setups.
You can also find this in a branch in the block git repo:
git://git.kernel.dk/linux-block.git wb-buf-throttle
Note that I rebase this branch when I collapse patches. The
wb-buf-throttle-v4 will remain the same as this version. I've folded
the device write cache changes into my 4.7 branches, so they are not
a part of this posting. Get the full wb-buf-throttle branch, or apply
the patches here on top of my for-next. A full patch against Linus'
current tree can also be downloaded here:
http://brick.kernel.dk/snaps/wb-buf-throttle-v4.patch
Changes since v3
- Re-do the mm/ writheback parts. Add REQ_BG for background writes,
and don't overload the wbc 'reason' for writeback decisions.
- Add tracking for when apps are sleeping waiting for a page to complete.
- Change wbc_to_write() to wbc_to_write_cmd().
- Use atomic_t for the balance_dirty_pages() sleep count.
- Add a basic scalable block stats tracking framework.
- Rewrite blk-wb core as described above, to dynamically adapt. This is
a big change, see the last patch for a full description of it.
- Add tracing to blk-wb, instead of using debug printk's.
- Rebased to 4.6-rc3 (ish)
Changes since v2
- Switch from wb_depth to wb_percent, as that's an easier tunable.
- Add the patch to track device depth on the block layer side.
- Cleanup the limiting code.
- Don't use a fixed limit in the wb wait, since it can change
between wakeups.
- Minor tweaks, fixups, cleanups.
Changes since v1
- Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change
- wb_start_writeback() fills in background/reclaim/sync info in
the writeback work, based on writeback reason.
- Use WRITE_SYNC for reclaim/sync IO
- Split balance_dirty_pages() sleep change into separate patch
- Drop get_request() u64 flag change, set the bit on the request
directly after-the-fact.
- Fix wrong sysfs return value
- Various small cleanups
Documentation/block/queue-sysfs.txt | 9
Documentation/block/writeback_cache_control.txt | 4
arch/um/drivers/ubd_kern.c | 2
block/Makefile | 2
block/blk-core.c | 22 +
block/blk-flush.c | 11
block/blk-mq-sysfs.c | 47 ++
block/blk-mq.c | 45 ++
block/blk-mq.h | 3
block/blk-settings.c | 58 +-
block/blk-stat.c | 184 ++++++++
block/blk-stat.h | 17
block/blk-sysfs.c | 122 +++++
block/blk-wb.c | 495 ++++++++++++++++++++++++
block/blk-wb.h | 42 ++
drivers/block/drbd/drbd_main.c | 2
drivers/block/loop.c | 2
drivers/block/mtip32xx/mtip32xx.c | 6
drivers/block/nbd.c | 4
drivers/block/osdblk.c | 2
drivers/block/ps3disk.c | 2
drivers/block/skd_main.c | 2
drivers/block/virtio_blk.c | 6
drivers/block/xen-blkback/xenbus.c | 2
drivers/block/xen-blkfront.c | 3
drivers/ide/ide-disk.c | 6
drivers/md/bcache/super.c | 2
drivers/md/dm-table.c | 20
drivers/md/md.c | 2
drivers/md/raid5-cache.c | 3
drivers/mmc/card/block.c | 2
drivers/mtd/mtd_blkdevs.c | 2
drivers/nvme/host/core.c | 7
drivers/scsi/scsi.c | 3
drivers/scsi/sd.c | 8
drivers/target/target_core_iblock.c | 6
fs/block_dev.c | 2
fs/buffer.c | 2
fs/f2fs/data.c | 2
fs/f2fs/node.c | 2
fs/gfs2/meta_io.c | 3
fs/mpage.c | 9
fs/xfs/xfs_aops.c | 2
include/linux/backing-dev-defs.h | 2
include/linux/blk_types.h | 14
include/linux/blkdev.h | 27 +
include/linux/fs.h | 4
include/linux/writeback.h | 10
include/trace/events/block.h | 98 ++++
mm/backing-dev.c | 1
mm/filemap.c | 42 +-
mm/page-writeback.c | 2
52 files changed, 1281 insertions(+), 96 deletions(-)
--
Jens Axboe
Test patch that throttles buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
The patch registers two sysfs entries. The first one, 'wb_lat_usec',
sets the latency target for the window. It defaults to 2 msec for
non-rotational storage, and 75 msec for rotational storage. Setting
this value to '0' disables blk-wb.
The second entry, 'wb_stats', is a debug entry, that simply shows the
current internal state of the throttling machine:
$ cat /sys/block/nvme0n1/queue/wb_stats
background=16, normal=32, max=64, inflight=0, wait=0, bdp_wait=0
'background' denotes how many requests we will allow in-flight for
idle background buffered writeback, 'normal' for higher priority
writeback, and 'max' for when it's urgent we clean pages.
'inflight' shows how many requests are currently in-flight for
buffered writeback, 'wait' shows if anyone is currently waiting for
access, and 'bdp_wait' shows if someone is currently throttled on this
device in balance_dirty_pages().
blk-wb also registers a few trace events, that can be used to monitor
the state changes:
block_wb_lat: Latency 2446318
block_wb_stat: read lat: mean=2446318, min=2446318, max=2446318, samples=1,
write lat: mean=518866, min=15522, max=5330353, samples=57
block_wb_step: step down: step=1, background=8, normal=16, max=32
'block_wb_lat' logs a violation in sync issue latency, 'block_wb_stat'
logs a window violation of latencies and dumps the stats that lead to
that, and finally, 'block_wb_stat' logs a step up/down and the new
limits associated with that state.
Signed-off-by: Jens Axboe <[email protected]>
---
block/Makefile | 2 +-
block/blk-core.c | 15 ++
block/blk-mq.c | 31 ++-
block/blk-settings.c | 4 +
block/blk-sysfs.c | 57 +++++
block/blk-wb.c | 495 +++++++++++++++++++++++++++++++++++++++++++
block/blk-wb.h | 42 ++++
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 3 +
include/trace/events/block.h | 98 +++++++++
10 files changed, 746 insertions(+), 3 deletions(-)
create mode 100644 block/blk-wb.c
create mode 100644 block/blk-wb.h
diff --git a/block/Makefile b/block/Makefile
index 3446e0472df0..7e4be7a56a59 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
- blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
+ blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o blk-wb.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 40b57bf4852c..d941f69dfb4b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
#include "blk.h"
#include "blk-mq.h"
+#include "blk-wb.h"
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
@@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
fail:
blk_free_flush_queue(q->fq);
+ blk_wb_exit(q);
return NULL;
}
EXPORT_SYMBOL(blk_init_allocated_queue);
@@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
blk_delete_timer(rq);
blk_clear_rq_complete(rq);
trace_block_rq_requeue(q, rq);
+ blk_wb_requeue(q->rq_wb, rq);
if (rq->cmd_flags & REQ_QUEUED)
blk_queue_end_tag(q, rq);
@@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
/* this is a bio leak */
WARN_ON(req->bio != NULL);
+ blk_wb_done(q->rq_wb, req);
+
/*
* Request may not have originated from ll_rw_blk. if not,
* it didn't come out of our reserved rq pools
@@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
struct request *req;
unsigned int request_count = 0;
+ bool wb_acct;
/*
* low level driver can indicate that it wants pages above a
@@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
}
get_rq:
+ wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
+
/*
* This sync check and mask will be re-done in init_request_from_bio(),
* but we need to set it earlier to expose the sync flag to the
@@ -1781,11 +1789,16 @@ get_rq:
*/
req = get_request(q, rw_flags, bio, GFP_NOIO);
if (IS_ERR(req)) {
+ if (wb_acct)
+ __blk_wb_done(q->rq_wb);
bio->bi_error = PTR_ERR(req);
bio_endio(bio);
goto out_unlock;
}
+ if (wb_acct)
+ req->cmd_flags |= REQ_BUF_INFLIGHT;
+
/*
* After dropping the lock and possibly sleeping here, our request
* may now be mergeable after it had proven unmergeable (above).
@@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req)
blk_dequeue_request(req);
req->issue_time = ktime_to_ns(ktime_get());
+ blk_wb_issue(req->q->rq_wb, req);
/*
* We are now handing the request to the hardware, initialize
@@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error)
blk_unprep_request(req);
blk_account_io_done(req);
+ blk_wb_done(req->q->rq_wb, req);
if (req->end_io)
req->end_io(req, error);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 71b4a13fbf94..c0c5207fe7fd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -30,6 +30,7 @@
#include "blk-mq.h"
#include "blk-mq-tag.h"
#include "blk-stat.h"
+#include "blk-wb.h"
static DEFINE_MUTEX(all_q_mutex);
static LIST_HEAD(all_q_list);
@@ -275,6 +276,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
if (rq->cmd_flags & REQ_MQ_INFLIGHT)
atomic_dec(&hctx->nr_active);
+
+ blk_wb_done(q->rq_wb, rq);
+
rq->cmd_flags = 0;
clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
@@ -305,6 +309,7 @@ EXPORT_SYMBOL_GPL(blk_mq_free_request);
inline void __blk_mq_end_request(struct request *rq, int error)
{
blk_account_io_done(rq);
+ blk_wb_done(rq->q->rq_wb, rq);
if (rq->end_io) {
rq->end_io(rq, error);
@@ -414,6 +419,7 @@ void blk_mq_start_request(struct request *rq)
rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
rq->issue_time = ktime_to_ns(ktime_get());
+ blk_wb_issue(q->rq_wb, rq);
blk_add_timer(rq);
@@ -450,6 +456,7 @@ static void __blk_mq_requeue_request(struct request *rq)
struct request_queue *q = rq->q;
trace_block_rq_requeue(q, rq);
+ blk_wb_requeue(q->rq_wb, rq);
if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -1265,6 +1272,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
struct blk_plug *plug;
struct request *same_queue_rq = NULL;
blk_qc_t cookie;
+ bool wb_acct;
blk_queue_bounce(q, &bio);
@@ -1282,9 +1290,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
} else
request_count = blk_plug_queued_count(q);
+ wb_acct = blk_wb_wait(q->rq_wb, bio, NULL);
+
rq = blk_mq_map_request(q, bio, &data);
- if (unlikely(!rq))
+ if (unlikely(!rq)) {
+ if (wb_acct)
+ __blk_wb_done(q->rq_wb);
return BLK_QC_T_NONE;
+ }
+
+ if (wb_acct)
+ rq->cmd_flags |= REQ_BUF_INFLIGHT;
cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
@@ -1361,6 +1377,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
struct blk_map_ctx data;
struct request *rq;
blk_qc_t cookie;
+ bool wb_acct;
blk_queue_bounce(q, &bio);
@@ -1375,9 +1392,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
blk_attempt_plug_merge(q, bio, &request_count, NULL))
return BLK_QC_T_NONE;
+ wb_acct = blk_wb_wait(q->rq_wb, bio, NULL);
+
rq = blk_mq_map_request(q, bio, &data);
- if (unlikely(!rq))
+ if (unlikely(!rq)) {
+ if (wb_acct)
+ __blk_wb_done(q->rq_wb);
return BLK_QC_T_NONE;
+ }
+
+ if (wb_acct)
+ rq->cmd_flags |= REQ_BUF_INFLIGHT;
cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
@@ -2111,6 +2136,8 @@ void blk_mq_free_queue(struct request_queue *q)
list_del_init(&q->all_q_node);
mutex_unlock(&all_q_mutex);
+ blk_wb_exit(q);
+
blk_mq_del_queue_tag_set(q);
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f7e122e717e8..84bcfc22e020 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -13,6 +13,7 @@
#include <linux/gfp.h>
#include "blk.h"
+#include "blk-wb.h"
unsigned long blk_max_low_pfn;
EXPORT_SYMBOL(blk_max_low_pfn);
@@ -840,6 +841,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
{
q->queue_depth = depth;
+
+ if (q->rq_wb)
+ blk_wb_update_limits(q->rq_wb);
}
EXPORT_SYMBOL(blk_set_queue_depth);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6e516cc0d3d0..13f325deffa1 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -13,6 +13,7 @@
#include "blk.h"
#include "blk-mq.h"
+#include "blk-wb.h"
struct queue_sysfs_entry {
struct attribute attr;
@@ -347,6 +348,47 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
return ret;
}
+static ssize_t queue_wb_stats_show(struct request_queue *q, char *page)
+{
+ struct rq_wb *rwb = q->rq_wb;
+
+ if (!rwb)
+ return -EINVAL;
+
+ return sprintf(page, "background=%d, normal=%d, max=%d, inflight=%d,"
+ " wait=%d, bdp_wait=%d\n", rwb->wb_background,
+ rwb->wb_normal, rwb->wb_max,
+ atomic_read(&rwb->inflight),
+ waitqueue_active(&rwb->wait),
+ atomic_read(rwb->bdp_wait));
+}
+
+static ssize_t queue_wb_lat_show(struct request_queue *q, char *page)
+{
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return sprintf(page, "%llu\n", q->rq_wb->min_lat_nsec / 1000ULL);
+}
+
+static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ u64 val;
+ int err;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ err = kstrtou64(page, 10, &val);
+ if (err < 0)
+ return err;
+
+ q->rq_wb->min_lat_nsec = val * 1000ULL;
+ blk_wb_update_limits(q->rq_wb);
+ return count;
+}
+
static ssize_t queue_wc_show(struct request_queue *q, char *page)
{
if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
@@ -541,6 +583,17 @@ static struct queue_sysfs_entry queue_stats_entry = {
.show = queue_stats_show,
};
+static struct queue_sysfs_entry queue_wb_stats_entry = {
+ .attr = {.name = "wb_stats", .mode = S_IRUGO },
+ .show = queue_wb_stats_show,
+};
+
+static struct queue_sysfs_entry queue_wb_lat_entry = {
+ .attr = {.name = "wb_lat_usec", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wb_lat_show,
+ .store = queue_wb_lat_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -568,6 +621,8 @@ static struct attribute *default_attrs[] = {
&queue_poll_entry.attr,
&queue_wc_entry.attr,
&queue_stats_entry.attr,
+ &queue_wb_stats_entry.attr,
+ &queue_wb_lat_entry.attr,
NULL,
};
@@ -721,6 +776,8 @@ int blk_register_queue(struct gendisk *disk)
if (q->mq_ops)
blk_mq_register_disk(disk);
+ blk_wb_init(q);
+
if (!q->request_fn)
return 0;
diff --git a/block/blk-wb.c b/block/blk-wb.c
new file mode 100644
index 000000000000..1b1d80876930
--- /dev/null
+++ b/block/blk-wb.c
@@ -0,0 +1,495 @@
+/*
+ * buffered writeback throttling. losely based on CoDel. We can't drop
+ * packets for IO scheduling, so the logic is something like this:
+ *
+ * - Monitor latencies in a defined window of time.
+ * - If the minimum latency in the above window exceeds some target, increment
+ * scaling step and scale down queue depth by a factor of 2x. The monitoring
+ * window is then shrunk to 100 / sqrt(scaling step + 1).
+ * - For any window where we don't have solid data on what the latencies
+ * look like, retain status quo.
+ * - If latencies look good, decrement scaling step.
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ * Things that (may) need changing:
+ *
+ * - Different scaling of background/normal/high priority writeback.
+ * We may have to violate guarantees for max.
+ * - We can have mismatches between the stat window and our window.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <trace/events/block.h>
+
+#include "blk.h"
+#include "blk-wb.h"
+#include "blk-stat.h"
+
+enum {
+ /*
+ * Might need to be higher
+ */
+ RWB_MAX_DEPTH = 64,
+
+ /*
+ * 100msec window
+ */
+ RWB_WINDOW_NSEC = 100 * 1000 * 1000ULL,
+
+ /*
+ * Disregard stats, if we don't meet these minimums
+ */
+ RWB_MIN_WRITE_SAMPLES = 3,
+ RWB_MIN_READ_SAMPLES = 1,
+
+ /*
+ * Target min latencies, in nsecs
+ */
+ RWB_ROT_LAT = 75000000ULL, /* 75 msec */
+ RWB_NONROT_LAT = 2000000ULL, /* 2 msec */
+};
+
+static inline bool rwb_enabled(struct rq_wb *rwb)
+{
+ return rwb && rwb->wb_normal != 0;
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+ int cur = atomic_read(v);
+
+ for (;;) {
+ int old;
+
+ if (cur >= below)
+ return false;
+ old = atomic_cmpxchg(v, cur, cur + 1);
+ if (old == cur)
+ break;
+ cur = old;
+ }
+
+ return true;
+}
+
+static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
+{
+ if (rwb_enabled(rwb)) {
+ const unsigned long cur = jiffies;
+
+ if (cur != *var)
+ *var = cur;
+ }
+}
+
+void __blk_wb_done(struct rq_wb *rwb)
+{
+ int inflight, limit = rwb->wb_normal;
+
+ /*
+ * If the device does write back caching, drop further down
+ * before we wake people up.
+ */
+ if (test_bit(QUEUE_FLAG_WC, &rwb->q->queue_flags) &&
+ !atomic_read(rwb->bdp_wait))
+ limit = 0;
+ else
+ limit = rwb->wb_normal;
+
+ /*
+ * Don't wake anyone up if we are above the normal limit. If
+ * throttling got disabled (limit == 0) with waiters, ensure
+ * that we wake them up.
+ */
+ inflight = atomic_dec_return(&rwb->inflight);
+ if (limit && inflight >= limit) {
+ if (!rwb->wb_max)
+ wake_up_all(&rwb->wait);
+ return;
+ }
+
+ if (waitqueue_active(&rwb->wait)) {
+ int diff = limit - inflight;
+
+ if (!inflight || diff >= rwb->wb_background / 2)
+ wake_up_nr(&rwb->wait, 1);
+ }
+}
+
+/*
+ * Called on completion of a request. Note that it's also called when
+ * a request is merged, when the request gets freed.
+ */
+void blk_wb_done(struct rq_wb *rwb, struct request *rq)
+{
+ if (!rwb)
+ return;
+
+ if (!(rq->cmd_flags & REQ_BUF_INFLIGHT)) {
+ if (rwb->sync_cookie == rq) {
+ rwb->sync_issue = 0;
+ rwb->sync_cookie = NULL;
+ }
+
+ wb_timestamp(rwb, &rwb->last_comp);
+ } else {
+ WARN_ON_ONCE(rq == rwb->sync_cookie);
+ __blk_wb_done(rwb);
+ rq->cmd_flags &= ~REQ_BUF_INFLIGHT;
+ }
+}
+
+static void calc_wb_limits(struct rq_wb *rwb)
+{
+ unsigned int depth;
+
+ if (!rwb->min_lat_nsec) {
+ rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
+ return;
+ }
+
+ depth = min_t(unsigned int, RWB_MAX_DEPTH, blk_queue_depth(rwb->q));
+
+ /*
+ * Reduce max depth by 50%, and re-calculate normal/bg based on that
+ */
+ rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
+ rwb->wb_normal = (rwb->wb_max + 1) / 2;
+ rwb->wb_background = (rwb->wb_max + 3) / 4;
+}
+
+static bool inline stat_sample_valid(struct blk_rq_stat *stat)
+{
+ /*
+ * We need at least one read sample, and a minimum of
+ * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
+ * that it's writes impacting us, and not just some sole read on
+ * a device that is in a lower power state.
+ */
+ return stat[0].nr_samples >= 1 &&
+ stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
+}
+
+static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
+{
+ u64 now, issue = ACCESS_ONCE(rwb->sync_issue);
+
+ if (!issue || !rwb->sync_cookie)
+ return 0;
+
+ now = ktime_to_ns(ktime_get());
+ return now - issue;
+}
+
+enum {
+ LAT_OK,
+ LAT_UNKNOWN,
+ LAT_EXCEEDED,
+};
+
+static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
+{
+ u64 thislat;
+
+ if (!stat_sample_valid(stat))
+ return LAT_UNKNOWN;
+
+ /*
+ * If the 'min' latency exceeds our target, step down.
+ */
+ if (stat[0].min > rwb->min_lat_nsec) {
+ trace_block_wb_lat(stat[0].min);
+ trace_block_wb_stat(stat);
+ return LAT_EXCEEDED;
+ }
+
+ /*
+ * If our stored sync issue exceeds the window size, or it
+ * exceeds our min target AND we haven't logged any entries,
+ * flag the latency as exceeded.
+ */
+ thislat = rwb_sync_issue_lat(rwb);
+ if (thislat > rwb->win_nsec ||
+ (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
+ trace_block_wb_lat(thislat);
+ return LAT_EXCEEDED;
+ }
+
+ if (rwb->scale_step)
+ trace_block_wb_stat(stat);
+
+ return LAT_OK;
+}
+
+static int latency_exceeded(struct rq_wb *rwb)
+{
+ struct blk_rq_stat stat[2];
+
+ blk_queue_stat_get(rwb->q, stat);
+
+ return __latency_exceeded(rwb, stat);
+}
+
+static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
+{
+ trace_block_wb_step(msg, rwb->scale_step, rwb->wb_background,
+ rwb->wb_normal, rwb->wb_max);
+}
+
+static void scale_up(struct rq_wb *rwb)
+{
+ /*
+ * If we're at 0, we can't go lower.
+ */
+ if (!rwb->scale_step)
+ return;
+
+ rwb->scale_step--;
+ calc_wb_limits(rwb);
+
+ if (waitqueue_active(&rwb->wait))
+ wake_up_all(&rwb->wait);
+
+ rwb_trace_step(rwb, "step up");
+}
+
+static void scale_down(struct rq_wb *rwb)
+{
+ /*
+ * Stop scaling down when we've hit the limit. This also prevents
+ * ->scale_step from going to crazy values, if the device can't
+ * keep up.
+ */
+ if (rwb->wb_max == 1)
+ return;
+
+ rwb->scale_step++;
+ blk_stat_clear(rwb->q);
+ calc_wb_limits(rwb);
+ rwb_trace_step(rwb, "step down");
+}
+
+static void rwb_arm_timer(struct rq_wb *rwb)
+{
+ unsigned long expires;
+
+ rwb->win_nsec = 1000000000ULL / int_sqrt((rwb->scale_step + 1) * 100);
+ expires = jiffies + nsecs_to_jiffies(rwb->win_nsec);
+ mod_timer(&rwb->window_timer, expires);
+}
+
+static void blk_wb_timer_fn(unsigned long data)
+{
+ struct rq_wb *rwb = (struct rq_wb *) data;
+ int status;
+
+ /*
+ * If we exceeded the latency target, step down. If we did not,
+ * step one level up. If we don't know enough to say either exceeded
+ * or ok, then don't do anything.
+ */
+ status = latency_exceeded(rwb);
+ switch (status) {
+ case LAT_EXCEEDED:
+ scale_down(rwb);
+ break;
+ case LAT_OK:
+ scale_up(rwb);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * Re-arm timer, if we have IO in flight
+ */
+ if (rwb->scale_step || atomic_read(&rwb->inflight))
+ rwb_arm_timer(rwb);
+}
+
+void blk_wb_update_limits(struct rq_wb *rwb)
+{
+ rwb->scale_step = 0;
+ calc_wb_limits(rwb);
+
+ if (waitqueue_active(&rwb->wait))
+ wake_up_all(&rwb->wait);
+}
+
+static bool close_io(struct rq_wb *rwb)
+{
+ const unsigned long now = jiffies;
+
+ return time_before(now, rwb->last_issue + HZ / 10) ||
+ time_before(now, rwb->last_comp + HZ / 10);
+}
+
+#define REQ_HIPRIO (REQ_SYNC | REQ_META | REQ_PRIO)
+
+static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
+{
+ unsigned int limit;
+
+ /*
+ * At this point we know it's a buffered write. If REQ_SYNC is
+ * set, then it's WB_SYNC_ALL writeback, and we'll use the max
+ * limit for that. If the write is marked as a background write,
+ * then use the idle limit, or go to normal if we haven't had
+ * competing IO for a bit.
+ */
+ if ((rw & REQ_HIPRIO) || atomic_read(rwb->bdp_wait))
+ limit = rwb->wb_max;
+ else if ((rw & REQ_BG) || close_io(rwb)) {
+ /*
+ * If less than 100ms since we completed unrelated IO,
+ * limit us to half the depth for background writeback.
+ */
+ limit = rwb->wb_background;
+ } else
+ limit = rwb->wb_normal;
+
+ return limit;
+}
+
+static inline bool may_queue(struct rq_wb *rwb, unsigned long rw)
+{
+ /*
+ * inc it here even if disabled, since we'll dec it at completion.
+ * this only happens if the task was sleeping in __blk_wb_wait(),
+ * and someone turned it off at the same time.
+ */
+ if (!rwb_enabled(rwb)) {
+ atomic_inc(&rwb->inflight);
+ return true;
+ }
+
+ return atomic_inc_below(&rwb->inflight, get_limit(rwb, rw));
+}
+
+/*
+ * Block if we will exceed our limit, or if we are currently waiting for
+ * the timer to kick off queuing again.
+ */
+static void __blk_wb_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
+{
+ DEFINE_WAIT(wait);
+
+ if (may_queue(rwb, rw))
+ return;
+
+ do {
+ prepare_to_wait_exclusive(&rwb->wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ if (may_queue(rwb, rw))
+ break;
+
+ if (lock)
+ spin_unlock_irq(lock);
+
+ io_schedule();
+
+ if (lock)
+ spin_lock_irq(lock);
+ } while (1);
+
+ finish_wait(&rwb->wait, &wait);
+}
+
+/*
+ * Returns true if the IO request should be accounted, false if not.
+ * May sleep, if we have exceeded the writeback limits. Caller can pass
+ * in an irq held spinlock, if it holds one when calling this function.
+ * If we do sleep, we'll release and re-grab it.
+ */
+bool blk_wb_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
+{
+ /*
+ * If disabled, or not a WRITE (or a discard), do nothing
+ */
+ if (!rwb_enabled(rwb) || !(bio->bi_rw & REQ_WRITE) ||
+ (bio->bi_rw & REQ_DISCARD))
+ goto no_q;
+
+ /*
+ * Don't throttle WRITE_ODIRECT
+ */
+ if ((bio->bi_rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC)
+ goto no_q;
+
+ __blk_wb_wait(rwb, bio->bi_rw, lock);
+
+ if (!timer_pending(&rwb->window_timer))
+ rwb_arm_timer(rwb);
+
+ return true;
+
+no_q:
+ wb_timestamp(rwb, &rwb->last_issue);
+ return false;
+}
+
+void blk_wb_issue(struct rq_wb *rwb, struct request *rq)
+{
+ if (!rwb_enabled(rwb))
+ return;
+ if (!(rq->cmd_flags & REQ_BUF_INFLIGHT) && !rwb->sync_issue) {
+ rwb->sync_cookie = rq;
+ rwb->sync_issue = rq->issue_time;
+ }
+}
+
+void blk_wb_requeue(struct rq_wb *rwb, struct request *rq)
+{
+ if (!rwb_enabled(rwb))
+ return;
+ if (rq == rwb->sync_cookie) {
+ rwb->sync_issue = 0;
+ rwb->sync_cookie = NULL;
+ }
+}
+
+void blk_wb_init(struct request_queue *q)
+{
+ struct rq_wb *rwb;
+
+ /*
+ * If this fails, we don't get throttling
+ */
+ rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
+ if (!rwb)
+ return;
+
+ atomic_set(&rwb->inflight, 0);
+ init_waitqueue_head(&rwb->wait);
+ setup_timer(&rwb->window_timer, blk_wb_timer_fn, (unsigned long) rwb);
+ rwb->last_comp = rwb->last_issue = jiffies;
+ rwb->bdp_wait = &q->backing_dev_info.wb.dirty_sleeping;
+ rwb->q = q;
+
+ if (blk_queue_nonrot(q))
+ rwb->min_lat_nsec = RWB_NONROT_LAT;
+ else
+ rwb->min_lat_nsec = RWB_ROT_LAT;
+
+ blk_wb_update_limits(rwb);
+ q->rq_wb = rwb;
+}
+
+void blk_wb_exit(struct request_queue *q)
+{
+ struct rq_wb *rwb = q->rq_wb;
+
+ if (rwb) {
+ del_timer_sync(&rwb->window_timer);
+ kfree(q->rq_wb);
+ q->rq_wb = NULL;
+ }
+}
diff --git a/block/blk-wb.h b/block/blk-wb.h
new file mode 100644
index 000000000000..6ad47195bc87
--- /dev/null
+++ b/block/blk-wb.h
@@ -0,0 +1,42 @@
+#ifndef BLK_WB_H
+#define BLK_WB_H
+
+#include <linux/atomic.h>
+#include <linux/wait.h>
+#include <linux/timer.h>
+
+struct rq_wb {
+ /*
+ * Settings that govern how we throttle
+ */
+ unsigned int wb_background; /* background writeback */
+ unsigned int wb_normal; /* normal writeback */
+ unsigned int wb_max; /* max throughput writeback */
+ unsigned int scale_step;
+
+ u64 win_nsec;
+
+ struct timer_list window_timer;
+
+ s64 sync_issue;
+ void *sync_cookie;
+
+ unsigned long last_issue; /* last non-throttled issue */
+ unsigned long last_comp; /* last non-throttled comp */
+ unsigned long min_lat_nsec;
+ atomic_t *bdp_wait;
+ struct request_queue *q;
+ atomic_t inflight;
+ wait_queue_head_t wait;
+};
+
+void __blk_wb_done(struct rq_wb *);
+void blk_wb_done(struct rq_wb *, struct request *);
+bool blk_wb_wait(struct rq_wb *, struct bio *, spinlock_t *);
+void blk_wb_init(struct request_queue *);
+void blk_wb_exit(struct request_queue *);
+void blk_wb_update_limits(struct rq_wb *);
+void blk_wb_requeue(struct rq_wb *, struct request *);
+void blk_wb_issue(struct rq_wb *, struct request *);
+
+#endif
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 2b4414fb4d8e..c41f8a303804 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -189,6 +189,7 @@ enum rq_flag_bits {
__REQ_PM, /* runtime pm request */
__REQ_HASHED, /* on IO scheduler merge hash */
__REQ_MQ_INFLIGHT, /* track inflight for MQ */
+ __REQ_BUF_INFLIGHT, /* track inflight for buffered */
__REQ_NR_BITS, /* stops here */
};
@@ -243,6 +244,7 @@ enum rq_flag_bits {
#define REQ_PM (1ULL << __REQ_PM)
#define REQ_HASHED (1ULL << __REQ_HASHED)
#define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT)
+#define REQ_BUF_INFLIGHT (1ULL << __REQ_BUF_INFLIGHT)
typedef unsigned int blk_qc_t;
#define BLK_QC_T_NONE -1U
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 87f6703ced71..230c55dc95ae 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -37,6 +37,7 @@ struct bsg_job;
struct blkcg_gq;
struct blk_flush_queue;
struct pr_ops;
+struct rq_wb;
#define BLKDEV_MIN_RQ 4
#define BLKDEV_MAX_RQ 128 /* Default maximum */
@@ -291,6 +292,8 @@ struct request_queue {
int nr_rqs[2]; /* # allocated [a]sync rqs */
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */
+ struct rq_wb *rq_wb;
+
/*
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
* is used, root blkg allocates from @q->root_rl and all other
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index e8a5eca1dbe5..8ae9f47d5287 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -667,6 +667,104 @@ TRACE_EVENT(block_rq_remap,
(unsigned long long)__entry->old_sector, __entry->nr_bios)
);
+/**
+ * block_wb_stat - trace stats for blk_wb
+ * @stat: array of read/write stats
+ */
+TRACE_EVENT(block_wb_stat,
+
+ TP_PROTO(struct blk_rq_stat *stat),
+
+ TP_ARGS(stat),
+
+ TP_STRUCT__entry(
+ __field( s64, rmean )
+ __field( u64, rmin )
+ __field( u64, rmax )
+ __field( s64, rnr_samples )
+ __field( s64, rtime )
+ __field( s64, wmean )
+ __field( u64, wmin )
+ __field( u64, wmax )
+ __field( s64, wnr_samples )
+ __field( s64, wtime )
+ ),
+
+ TP_fast_assign(
+ __entry->rmean = stat[0].mean;
+ __entry->rmin = stat[0].min;
+ __entry->rmax = stat[0].max;
+ __entry->rnr_samples = stat[0].nr_samples;
+ __entry->wmean = stat[1].mean;
+ __entry->wmin = stat[1].min;
+ __entry->wmax = stat[1].max;
+ __entry->wnr_samples = stat[1].nr_samples;
+ ),
+
+ TP_printk("read lat: mean=%llu, min=%llu, max=%llu, samples=%llu,"
+ "write lat: mean=%llu, min=%llu, max=%llu, samples=%llu\n",
+ __entry->rmean, __entry->rmin, __entry->rmax,
+ __entry->rnr_samples, __entry->wmean, __entry->wmin,
+ __entry->wmax, __entry->wnr_samples)
+);
+
+/**
+ * block_wb_lat - trace latency event
+ * @lat: latency trigger
+ */
+TRACE_EVENT(block_wb_lat,
+
+ TP_PROTO(unsigned long lat),
+
+ TP_ARGS(lat),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, lat )
+ ),
+
+ TP_fast_assign(
+ __entry->lat = lat;
+ ),
+
+ TP_printk("Latency %llu\n", (unsigned long long) __entry->lat)
+);
+
+/**
+ * block_wb_step - trace wb event step
+ * @msg: context message
+ * @step: the current scale step count
+ * @bg: the current background queue limit
+ * @normal: the current normal writeback limit
+ * @max: the current max throughput writeback limit
+ */
+TRACE_EVENT(block_wb_step,
+
+ TP_PROTO(const char *msg, unsigned int step, unsigned int bg,
+ unsigned int normal, unsigned int max),
+
+ TP_ARGS(msg, step, bg, normal, max),
+
+ TP_STRUCT__entry(
+ __field( const char *, msg )
+ __field( unsigned int, step )
+ __field( unsigned int, bg )
+ __field( unsigned int, normal )
+ __field( unsigned int, max )
+ ),
+
+ TP_fast_assign(
+ __entry->msg = msg;
+ __entry->step = step;
+ __entry->bg = bg;
+ __entry->normal = normal;
+ __entry->max = max;
+ ),
+
+ TP_printk("%s: step=%u, background=%u, normal=%u, max=%u\n",
+ __entry->msg, __entry->step, __entry->bg, __entry->normal,
+ __entry->max)
+);
+
#endif /* _TRACE_BLOCK_H */
/* This part must be outside protection */
--
2.8.0.rc4.6.g7e4ba36
If we're doing background type writes, then use the appropriate
write command for that.
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/writeback.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index aa66fa05ff0d..6e4a35acaa3e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -104,6 +104,8 @@ static inline int wbc_to_write_cmd(struct writeback_control *wbc)
{
if (wbc->sync_mode == WB_SYNC_ALL)
return WRITE_SYNC;
+ else if (wbc->for_kupdate || wbc->for_background)
+ return WRITE_BG;
return WRITE;
}
--
2.8.0.rc4.6.g7e4ba36
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.
The stats are tracked in, roughly, 0.1s interval windows.
Add sysfs files to display the stats.
Signed-off-by: Jens Axboe <[email protected]>
---
block/Makefile | 2 +-
block/blk-core.c | 4 +
block/blk-mq-sysfs.c | 47 ++++++++++++
block/blk-mq.c | 14 ++++
block/blk-mq.h | 3 +
block/blk-stat.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++
block/blk-stat.h | 17 +++++
block/blk-sysfs.c | 26 +++++++
include/linux/blk_types.h | 8 ++
include/linux/blkdev.h | 4 +
10 files changed, 308 insertions(+), 1 deletion(-)
create mode 100644 block/blk-stat.c
create mode 100644 block/blk-stat.h
diff --git a/block/Makefile b/block/Makefile
index 9eda2322b2d4..3446e0472df0 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
- blk-lib.o blk-mq.o blk-mq-tag.o \
+ blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 74c16fd8995d..40b57bf4852c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2514,6 +2514,8 @@ void blk_start_request(struct request *req)
{
blk_dequeue_request(req);
+ req->issue_time = ktime_to_ns(ktime_get());
+
/*
* We are now handing the request to the hardware, initialize
* resid_len to full count and add the timeout handler.
@@ -2581,6 +2583,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
trace_block_rq_complete(req->q, req, nr_bytes);
+ blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
+
if (!req->bio)
return false;
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4ea4dd8a1eed..2f68015f8616 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -247,6 +247,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
return ret;
}
+static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
+{
+ struct blk_mq_ctx *ctx;
+ unsigned int i;
+
+ hctx_for_each_ctx(hctx, ctx, i) {
+ blk_stat_init(&ctx->stat[0]);
+ blk_stat_init(&ctx->stat[1]);
+ }
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
+ const char *page, size_t count)
+{
+ blk_mq_stat_clear(hctx);
+ return count;
+}
+
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
+{
+ return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+ pre, (long long) stat->nr_samples,
+ (long long) stat->mean, (long long) stat->min,
+ (long long) stat->max);
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page)
+{
+ struct blk_rq_stat stat[2];
+ ssize_t ret;
+
+ blk_stat_init(&stat[0]);
+ blk_stat_init(&stat[1]);
+
+ blk_hctx_stat_get(hctx, stat);
+
+ ret = print_stat(page, &stat[0], "read :");
+ ret += print_stat(page + ret, &stat[1], "write:");
+ return ret;
+}
+
static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
.attr = {.name = "dispatched", .mode = S_IRUGO },
.show = blk_mq_sysfs_dispatched_show,
@@ -304,6 +345,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
.attr = {.name = "io_poll", .mode = S_IRUGO },
.show = blk_mq_hw_sysfs_poll_show,
};
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
+ .attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
+ .show = blk_mq_hw_sysfs_stat_show,
+ .store = blk_mq_hw_sysfs_stat_store,
+};
static struct attribute *default_hw_ctx_attrs[] = {
&blk_mq_hw_sysfs_queued.attr,
@@ -314,6 +360,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
&blk_mq_hw_sysfs_cpus.attr,
&blk_mq_hw_sysfs_active.attr,
&blk_mq_hw_sysfs_poll.attr,
+ &blk_mq_hw_sysfs_stat.attr,
NULL,
};
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1699baf39b78..71b4a13fbf94 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -29,6 +29,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-tag.h"
+#include "blk-stat.h"
static DEFINE_MUTEX(all_q_mutex);
static LIST_HEAD(all_q_list);
@@ -356,10 +357,19 @@ static void blk_mq_ipi_complete_request(struct request *rq)
put_cpu();
}
+static void blk_mq_stat_add(struct request *rq)
+{
+ struct blk_rq_stat *stat = &rq->mq_ctx->stat[rq_data_dir(rq)];
+
+ blk_stat_add(stat, rq);
+}
+
static void __blk_mq_complete_request(struct request *rq)
{
struct request_queue *q = rq->q;
+ blk_mq_stat_add(rq);
+
if (!q->softirq_done_fn)
blk_mq_end_request(rq, rq->errors);
else
@@ -403,6 +413,8 @@ void blk_mq_start_request(struct request *rq)
if (unlikely(blk_bidi_rq(rq)))
rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
+ rq->issue_time = ktime_to_ns(ktime_get());
+
blk_add_timer(rq);
/*
@@ -1761,6 +1773,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
spin_lock_init(&__ctx->lock);
INIT_LIST_HEAD(&__ctx->rq_list);
__ctx->queue = q;
+ blk_stat_init(&__ctx->stat[0]);
+ blk_stat_init(&__ctx->stat[1]);
/* If the cpu isn't online, the cpu is mapped to first hctx */
if (!cpu_online(i))
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11037b7..e107f700ff17 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -1,6 +1,8 @@
#ifndef INT_BLK_MQ_H
#define INT_BLK_MQ_H
+#include "blk-stat.h"
+
struct blk_mq_tag_set;
struct blk_mq_ctx {
@@ -20,6 +22,7 @@ struct blk_mq_ctx {
/* incremented at completion time */
unsigned long ____cacheline_aligned_in_smp rq_completed[2];
+ struct blk_rq_stat stat[2];
struct request_queue *queue;
struct kobject kobj;
diff --git a/block/blk-stat.c b/block/blk-stat.c
new file mode 100644
index 000000000000..b38776a83173
--- /dev/null
+++ b/block/blk-stat.c
@@ -0,0 +1,184 @@
+/*
+ * Block stat tracking code
+ *
+ * Copyright (C) 2016 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/blk-mq.h>
+
+#include "blk-stat.h"
+#include "blk-mq.h"
+
+void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
+{
+ if (!src->nr_samples)
+ return;
+
+ dst->min = min(dst->min, src->min);
+ dst->max = max(dst->max, src->max);
+
+ if (!dst->nr_samples)
+ dst->mean = src->mean;
+ else {
+ dst->mean = div64_s64((src->mean * src->nr_samples) +
+ (dst->mean * dst->nr_samples),
+ dst->nr_samples + src->nr_samples);
+ }
+ dst->nr_samples += src->nr_samples;
+}
+
+static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+{
+ struct blk_mq_hw_ctx *hctx;
+ struct blk_mq_ctx *ctx;
+ int i, j, nr;
+
+ blk_stat_init(&dst[0]);
+ blk_stat_init(&dst[1]);
+
+ nr = 0;
+ do {
+ uint64_t newest = 0;
+
+ queue_for_each_hw_ctx(q, hctx, i) {
+ hctx_for_each_ctx(hctx, ctx, j) {
+ if (!ctx->stat[0].nr_samples &&
+ !ctx->stat[1].nr_samples)
+ continue;
+ if (ctx->stat[0].time > newest)
+ newest = ctx->stat[0].time;
+ if (ctx->stat[1].time > newest)
+ newest = ctx->stat[1].time;
+ }
+ }
+
+ /*
+ * No samples
+ */
+ if (!newest)
+ break;
+
+ queue_for_each_hw_ctx(q, hctx, i) {
+ hctx_for_each_ctx(hctx, ctx, j) {
+ if (ctx->stat[0].time == newest) {
+ blk_stat_sum(&dst[0], &ctx->stat[0]);
+ nr++;
+ }
+ if (ctx->stat[1].time == newest) {
+ blk_stat_sum(&dst[1], &ctx->stat[1]);
+ nr++;
+ }
+ }
+ }
+ /*
+ * If we race on finding an entry, just loop back again.
+ * Should be very rare.
+ */
+ } while (!nr);
+}
+
+void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+{
+ if (q->mq_ops)
+ blk_mq_stat_get(q, dst);
+ else {
+ memcpy(&dst[0], &q->rq_stats[0], sizeof(struct blk_rq_stat));
+ memcpy(&dst[1], &q->rq_stats[1], sizeof(struct blk_rq_stat));
+ }
+}
+
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
+{
+ struct blk_mq_ctx *ctx;
+ unsigned int i, nr;
+
+ nr = 0;
+ do {
+ uint64_t newest = 0;
+
+ hctx_for_each_ctx(hctx, ctx, i) {
+ if (!ctx->stat[0].nr_samples &&
+ !ctx->stat[1].nr_samples)
+ continue;
+
+ if (ctx->stat[0].time > newest)
+ newest = ctx->stat[0].time;
+ if (ctx->stat[1].time > newest)
+ newest = ctx->stat[1].time;
+ }
+
+ if (!newest)
+ break;
+
+ hctx_for_each_ctx(hctx, ctx, i) {
+ if (ctx->stat[0].time == newest) {
+ blk_stat_sum(&dst[0], &ctx->stat[0]);
+ nr++;
+ }
+ if (ctx->stat[1].time == newest) {
+ blk_stat_sum(&dst[1], &ctx->stat[1]);
+ nr++;
+ }
+ }
+ /*
+ * If we race on finding an entry, just loop back again.
+ * Should be very rare, as the window is only updated
+ * occasionally
+ */
+ } while (!nr);
+}
+
+static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now)
+{
+ stat->min = -1ULL;
+ stat->max = stat->nr_samples = stat->mean = 0;
+ stat->time = time_now & BLK_STAT_MASK;
+}
+
+void blk_stat_init(struct blk_rq_stat *stat)
+{
+ __blk_stat_init(stat, ktime_to_ns(ktime_get()));
+}
+
+void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
+{
+ s64 delta, now, value;
+
+ now = ktime_to_ns(ktime_get());
+ if (now < rq->issue_time)
+ return;
+
+ if ((now & BLK_STAT_MASK) != (stat->time & BLK_STAT_MASK))
+ __blk_stat_init(stat, now);
+
+ value = now - rq->issue_time;
+ if (value > stat->max)
+ stat->max = value;
+ if (value < stat->min)
+ stat->min = value;
+
+ delta = value - stat->mean;
+ if (delta)
+ stat->mean += div64_s64(delta, stat->nr_samples + 1);
+
+ stat->nr_samples++;
+}
+
+void blk_stat_clear(struct request_queue *q)
+{
+ if (q->mq_ops) {
+ struct blk_mq_hw_ctx *hctx;
+ struct blk_mq_ctx *ctx;
+ int i, j;
+
+ queue_for_each_hw_ctx(q, hctx, i) {
+ hctx_for_each_ctx(hctx, ctx, j) {
+ blk_stat_init(&ctx->stat[0]);
+ blk_stat_init(&ctx->stat[1]);
+ }
+ }
+ } else {
+ blk_stat_init(&q->rq_stats[0]);
+ blk_stat_init(&q->rq_stats[1]);
+ }
+}
diff --git a/block/blk-stat.h b/block/blk-stat.h
new file mode 100644
index 000000000000..d77548dbf196
--- /dev/null
+++ b/block/blk-stat.h
@@ -0,0 +1,17 @@
+#ifndef BLK_STAT_H
+#define BLK_STAT_H
+
+/*
+ * ~0.13s window as a power-of-2 (2^27 nsecs)
+ */
+#define BLK_STAT_NSEC 134217728ULL
+#define BLK_STAT_MASK ~(BLK_STAT_NSEC - 1)
+
+void blk_stat_add(struct blk_rq_stat *, struct request *);
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *);
+void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
+void blk_stat_clear(struct request_queue *q);
+void blk_stat_init(struct blk_rq_stat *);
+void blk_stat_sum(struct blk_rq_stat *, struct blk_rq_stat *);
+
+#endif
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 99205965f559..6e516cc0d3d0 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -379,6 +379,26 @@ static ssize_t queue_wc_store(struct request_queue *q, const char *page,
return count;
}
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
+{
+ return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+ pre, (long long) stat->nr_samples,
+ (long long) stat->mean, (long long) stat->min,
+ (long long) stat->max);
+}
+
+static ssize_t queue_stats_show(struct request_queue *q, char *page)
+{
+ struct blk_rq_stat stat[2];
+ ssize_t ret;
+
+ blk_queue_stat_get(q, stat);
+
+ ret = print_stat(page, &stat[0], "read :");
+ ret += print_stat(page + ret, &stat[1], "write:");
+ return ret;
+}
+
static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
.show = queue_requests_show,
@@ -516,6 +536,11 @@ static struct queue_sysfs_entry queue_wc_entry = {
.store = queue_wc_store,
};
+static struct queue_sysfs_entry queue_stats_entry = {
+ .attr = {.name = "stats", .mode = S_IRUGO },
+ .show = queue_stats_show,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -542,6 +567,7 @@ static struct attribute *default_attrs[] = {
&queue_random_entry.attr,
&queue_poll_entry.attr,
&queue_wc_entry.attr,
+ &queue_stats_entry.attr,
NULL,
};
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 223012451c7a..2b4414fb4d8e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -268,4 +268,12 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
}
+struct blk_rq_stat {
+ s64 mean;
+ u64 min;
+ u64 max;
+ s64 nr_samples;
+ s64 time;
+};
+
#endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index eee94bd6de52..87f6703ced71 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -153,6 +153,7 @@ struct request {
struct gendisk *rq_disk;
struct hd_struct *part;
unsigned long start_time;
+ s64 issue_time;
#ifdef CONFIG_BLK_CGROUP
struct request_list *rl; /* rl this rq is alloced from */
unsigned long long start_time_ns;
@@ -402,6 +403,9 @@ struct request_queue {
unsigned int nr_sorted;
unsigned int in_flight[2];
+
+ struct blk_rq_stat rq_stats[2];
+
/*
* Number of active block driver functions for which blk_drain_queue()
* must wait. Must be incremented around functions that unlock the
--
2.8.0.rc4.6.g7e4ba36
For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.
Fill a hole in struct request with a 'queue_depth' member, that
drivers can call to more closely inform the block layer of the
real queue depth.
Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-settings.c | 12 ++++++++++++
drivers/scsi/scsi.c | 3 +++
include/linux/blkdev.h | 11 +++++++++++
3 files changed, 26 insertions(+)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f679ae122843..f7e122e717e8 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -832,6 +832,18 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
/**
+ * blk_set_queue_depth - tell the block layer about the device queue depth
+ * @q: the request queue for the device
+ * @depth: queue depth
+ *
+ */
+void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
+{
+ q->queue_depth = depth;
+}
+EXPORT_SYMBOL(blk_set_queue_depth);
+
+/**
* blk_queue_write_cache - configure queue's write cache
* @q: the request queue for the device
* @wc: write back cache on or off
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 1deb6adc411f..75455d4dab68 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth)
wmb();
}
+ if (sdev->request_queue)
+ blk_set_queue_depth(sdev->request_queue, depth);
+
return sdev->queue_depth;
}
EXPORT_SYMBOL(scsi_change_queue_depth);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fc1894996b12..eee94bd6de52 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -315,6 +315,8 @@ struct request_queue {
struct blk_mq_ctx __percpu *queue_ctx;
unsigned int nr_queues;
+ unsigned int queue_depth;
+
/* hw dispatch queues */
struct blk_mq_hw_ctx **queue_hw_ctx;
unsigned int nr_hw_queues;
@@ -681,6 +683,14 @@ static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
return false;
}
+static inline unsigned int blk_queue_depth(struct request_queue *q)
+{
+ if (q->queue_depth)
+ return q->queue_depth;
+
+ return q->nr_requests;
+}
+
/*
* q->prep_rq_fn return values
*/
@@ -984,6 +994,7 @@ extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
+extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
extern void blk_set_default_limits(struct queue_limits *lim);
extern void blk_set_stacking_limits(struct queue_limits *lim);
extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
--
2.8.0.rc4.6.g7e4ba36
Add wbc_to_write_cmd(), which returns the write type to use, based on a
struct writeback_control. No functional changes in this patch, but it
prepares us for factoring other wbc fields for write type.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/block_dev.c | 2 +-
fs/buffer.c | 2 +-
fs/f2fs/data.c | 2 +-
fs/f2fs/node.c | 2 +-
fs/gfs2/meta_io.c | 3 +--
fs/mpage.c | 9 ++++-----
fs/xfs/xfs_aops.c | 2 +-
include/linux/writeback.h | 8 ++++++++
8 files changed, 18 insertions(+), 12 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 20a2c02b77c4..8662da6aa07c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -432,7 +432,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
struct page *page, struct writeback_control *wbc)
{
int result;
- int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
+ int rw = wbc_to_write_cmd(wbc);
const struct block_device_operations *ops = bdev->bd_disk->fops;
if (!ops->rw_page || bdev_get_integrity(bdev))
diff --git a/fs/buffer.c b/fs/buffer.c
index af0d9a82a8ed..46763c58e786 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1697,7 +1697,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *bh, *head;
unsigned int blocksize, bbits;
int nr_underway = 0;
- int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+ int write_op = wbc_to_write_cmd(wbc);
head = create_page_buffers(page, inode,
(1 << BH_Dirty)|(1 << BH_Uptodate));
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 5dafb9cef12e..e4e81ce663c5 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1153,7 +1153,7 @@ static int f2fs_write_data_page(struct page *page,
struct f2fs_io_info fio = {
.sbi = sbi,
.type = DATA,
- .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+ .rw = wbc_to_write_cmd(wbc),
.page = page,
.encrypted_page = NULL,
};
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 1a33de9d84b1..3b377258dc09 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1397,7 +1397,7 @@ static int f2fs_write_node_page(struct page *page,
struct f2fs_io_info fio = {
.sbi = sbi,
.type = NODE,
- .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+ .rw = wbc_to_write_cmd(wbc),
.page = page,
.encrypted_page = NULL,
};
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 0448524c11bc..3fdfa3848f18 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb
{
struct buffer_head *bh, *head;
int nr_underway = 0;
- int write_op = REQ_META | REQ_PRIO |
- (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+ int write_op = REQ_META | REQ_PRIO | wbc_to_write_cmd(wbc);
BUG_ON(!PageLocked(page));
BUG_ON(!page_has_buffers(page));
diff --git a/fs/mpage.c b/fs/mpage.c
index eedc644b78d7..bcbdb61b24f1 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -486,7 +486,6 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
struct buffer_head map_bh;
loff_t i_size = i_size_read(inode);
int ret = 0;
- int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
if (page_has_buffers(page)) {
struct buffer_head *head = page_buffers(page);
@@ -595,7 +594,7 @@ page_is_mapped:
* This page will go to BIO. Do we need to send this BIO off first?
*/
if (bio && mpd->last_block_in_bio != blocks[0] - 1)
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
alloc_new:
if (bio == NULL) {
@@ -622,7 +621,7 @@ alloc_new:
wbc_account_io(wbc, page, PAGE_SIZE);
length = first_unmapped << blkbits;
if (bio_add_page(bio, page, length, 0) < length) {
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
goto alloc_new;
}
@@ -632,7 +631,7 @@ alloc_new:
set_page_writeback(page);
unlock_page(page);
if (boundary || (first_unmapped != blocks_per_page)) {
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
if (boundary_block) {
write_boundary_block(boundary_bdev,
boundary_block, 1 << blkbits);
@@ -644,7 +643,7 @@ alloc_new:
confused:
if (bio)
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
if (mpd->use_writepage) {
ret = mapping->a_ops->writepage(page, wbc);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e49b2406d15d..e6c721f4153b 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -393,7 +393,7 @@ xfs_submit_ioend_bio(
atomic_inc(&ioend->io_remaining);
bio->bi_private = ioend;
bio->bi_end_io = xfs_end_bio;
- submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
+ submit_bio(wbc_to_write_cmd(wbc), bio);
}
STATIC struct bio *
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d0b5ca5d4e08..aa66fa05ff0d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -100,6 +100,14 @@ struct writeback_control {
#endif
};
+static inline int wbc_to_write_cmd(struct writeback_control *wbc)
+{
+ if (wbc->sync_mode == WB_SYNC_ALL)
+ return WRITE_SYNC;
+
+ return WRITE;
+}
+
/*
* A wb_domain represents a domain that wb's (bdi_writeback's) belong to
* and are measured against each other in. There always is one global
--
2.8.0.rc4.6.g7e4ba36
If we end up waiting on a page that is dirty or marked writeback,
then increment the corresponding bdi_writeback counter.
Signed-off-by: Jens Axboe <[email protected]>
---
mm/filemap.c | 42 +++++++++++++++++++++++++++++++++++++++---
1 file changed, 39 insertions(+), 3 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index f2479af09da9..a8854a083b71 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -764,37 +764,73 @@ wait_queue_head_t *page_waitqueue(struct page *page)
}
EXPORT_SYMBOL(page_waitqueue);
+static bool inc_dirty_wait(struct page *page)
+{
+ if (!page->mapping || !PageDirty(page) || !PageWriteback(page))
+ return false;
+ else {
+ struct bdi_writeback *wb = inode_to_wb(page->mapping->host);
+
+ atomic_inc(&wb->dirty_sleeping);
+ return true;
+ }
+}
+
+static void dec_dirty_wait(struct page *page)
+{
+ struct bdi_writeback *wb = inode_to_wb(page->mapping->host);
+
+ atomic_dec(&wb->dirty_sleeping);
+}
+
void wait_on_page_bit(struct page *page, int bit_nr)
{
DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
- if (test_bit(bit_nr, &page->flags))
+ if (test_bit(bit_nr, &page->flags)) {
+ bool did_inc = inc_dirty_wait(page);
__wait_on_bit(page_waitqueue(page), &wait, bit_wait_io,
TASK_UNINTERRUPTIBLE);
+ if (did_inc)
+ dec_dirty_wait(page);
+ }
}
EXPORT_SYMBOL(wait_on_page_bit);
int wait_on_page_bit_killable(struct page *page, int bit_nr)
{
DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+ bool did_inc;
+ int ret;
if (!test_bit(bit_nr, &page->flags))
return 0;
- return __wait_on_bit(page_waitqueue(page), &wait,
+ did_inc = inc_dirty_wait(page);
+ ret = __wait_on_bit(page_waitqueue(page), &wait,
bit_wait_io, TASK_KILLABLE);
+ if (did_inc)
+ dec_dirty_wait(page);
+ return ret;
}
int wait_on_page_bit_killable_timeout(struct page *page,
int bit_nr, unsigned long timeout)
{
DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+ bool did_inc;
+ int ret;
wait.key.timeout = jiffies + timeout;
if (!test_bit(bit_nr, &page->flags))
return 0;
- return __wait_on_bit(page_waitqueue(page), &wait,
+
+ did_inc = inc_dirty_wait(page);
+ ret = __wait_on_bit(page_waitqueue(page), &wait,
bit_wait_io_timeout, TASK_KILLABLE);
+ if (did_inc)
+ dec_dirty_wait(page);
+ return ret;
}
EXPORT_SYMBOL_GPL(wait_on_page_bit_killable_timeout);
--
2.8.0.rc4.6.g7e4ba36
Note in the bdi_writeback structure if a task is currently being
limited in balance_dirty_pages(), waiting for writeback to
proceed.
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/backing-dev-defs.h | 2 ++
mm/backing-dev.c | 1 +
mm/page-writeback.c | 2 ++
3 files changed, 5 insertions(+)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 3f103076d0bf..1212c374b928 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
struct list_head work_list;
struct delayed_work dwork; /* work item used for writeback */
+ atomic_t dirty_sleeping; /* waiting on dirty limit exceeded */
+
struct list_head bdi_node; /* anchored at bdi->wb_list */
#ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0c6317b7db38..41db7dff11d0 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -310,6 +310,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
spin_lock_init(&wb->work_lock);
INIT_LIST_HEAD(&wb->work_list);
INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
+ atomic_set(&wb->dirty_sleeping, 0);
wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
if (!wb->congested)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 999792d35ccc..028a3d4d7129 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1746,7 +1746,9 @@ pause:
pause,
start_time);
__set_current_state(TASK_KILLABLE);
+ atomic_inc(&wb->dirty_sleeping);
io_schedule_timeout(pause);
+ atomic_dec(&wb->dirty_sleeping);
current->dirty_paused_when = now + pause;
current->nr_dirtied = 0;
--
2.8.0.rc4.6.g7e4ba36
This adds a new request flag, REQ_BG, that callers can use to tell
the block layer that this is background (non-urgent) IO.
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/blk_types.h | 4 +++-
include/linux/fs.h | 4 ++++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86a38ea1823f..223012451c7a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -161,6 +161,7 @@ enum rq_flag_bits {
__REQ_INTEGRITY, /* I/O includes block integrity payload */
__REQ_FUA, /* forced unit access */
__REQ_FLUSH, /* request for cache flush */
+ __REQ_BG, /* background activity */
/* bio only flags */
__REQ_RAHEAD, /* read ahead, can fail anytime */
@@ -208,7 +209,7 @@ enum rq_flag_bits {
#define REQ_COMMON_MASK \
(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
- REQ_SECURE | REQ_INTEGRITY)
+ REQ_SECURE | REQ_INTEGRITY | REQ_BG)
#define REQ_CLONE_MASK REQ_COMMON_MASK
#define BIO_NO_ADVANCE_ITER_MASK (REQ_DISCARD|REQ_WRITE_SAME)
@@ -235,6 +236,7 @@ enum rq_flag_bits {
#define REQ_COPY_USER (1ULL << __REQ_COPY_USER)
#define REQ_FLUSH (1ULL << __REQ_FLUSH)
#define REQ_FLUSH_SEQ (1ULL << __REQ_FLUSH_SEQ)
+#define REQ_BG (1ULL << __REQ_BG)
#define REQ_IO_STAT (1ULL << __REQ_IO_STAT)
#define REQ_MIXED_MERGE (1ULL << __REQ_MIXED_MERGE)
#define REQ_SECURE (1ULL << __REQ_SECURE)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 70e61b58baaf..bb8f951cc619 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -192,6 +192,9 @@ typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
* WRITE_FLUSH_FUA Combination of WRITE_FLUSH and FUA. The IO is preceded
* by a cache flush and data is guaranteed to be on
* non-volatile media on completion.
+ * WRITE_BG Background write. This is for background activity like
+ * the periodic flush and background threshold writeback
+ *
*
*/
#define RW_MASK REQ_WRITE
@@ -207,6 +210,7 @@ typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
#define WRITE_FLUSH (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH)
#define WRITE_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FUA)
#define WRITE_FLUSH_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
+#define WRITE_BG (WRITE | REQ_NOIDLE | REQ_BG)
/*
* Attribute flags. These should be or-ed together to figure out what
--
2.8.0.rc4.6.g7e4ba36
On Sun 17-04-16 23:24:41, Jens Axboe wrote:
> Add wbc_to_write_cmd(), which returns the write type to use, based on a
> struct writeback_control. No functional changes in this patch, but it
> prepares us for factoring other wbc fields for write type.
>
> Signed-off-by: Jens Axboe <[email protected]>
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/block_dev.c | 2 +-
> fs/buffer.c | 2 +-
> fs/f2fs/data.c | 2 +-
> fs/f2fs/node.c | 2 +-
> fs/gfs2/meta_io.c | 3 +--
> fs/mpage.c | 9 ++++-----
> fs/xfs/xfs_aops.c | 2 +-
> include/linux/writeback.h | 8 ++++++++
> 8 files changed, 18 insertions(+), 12 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 20a2c02b77c4..8662da6aa07c 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -432,7 +432,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
> struct page *page, struct writeback_control *wbc)
> {
> int result;
> - int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
> + int rw = wbc_to_write_cmd(wbc);
> const struct block_device_operations *ops = bdev->bd_disk->fops;
>
> if (!ops->rw_page || bdev_get_integrity(bdev))
> diff --git a/fs/buffer.c b/fs/buffer.c
> index af0d9a82a8ed..46763c58e786 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1697,7 +1697,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> struct buffer_head *bh, *head;
> unsigned int blocksize, bbits;
> int nr_underway = 0;
> - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
> + int write_op = wbc_to_write_cmd(wbc);
>
> head = create_page_buffers(page, inode,
> (1 << BH_Dirty)|(1 << BH_Uptodate));
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 5dafb9cef12e..e4e81ce663c5 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -1153,7 +1153,7 @@ static int f2fs_write_data_page(struct page *page,
> struct f2fs_io_info fio = {
> .sbi = sbi,
> .type = DATA,
> - .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
> + .rw = wbc_to_write_cmd(wbc),
> .page = page,
> .encrypted_page = NULL,
> };
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 1a33de9d84b1..3b377258dc09 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1397,7 +1397,7 @@ static int f2fs_write_node_page(struct page *page,
> struct f2fs_io_info fio = {
> .sbi = sbi,
> .type = NODE,
> - .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
> + .rw = wbc_to_write_cmd(wbc),
> .page = page,
> .encrypted_page = NULL,
> };
> diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
> index 0448524c11bc..3fdfa3848f18 100644
> --- a/fs/gfs2/meta_io.c
> +++ b/fs/gfs2/meta_io.c
> @@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb
> {
> struct buffer_head *bh, *head;
> int nr_underway = 0;
> - int write_op = REQ_META | REQ_PRIO |
> - (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
> + int write_op = REQ_META | REQ_PRIO | wbc_to_write_cmd(wbc);
>
> BUG_ON(!PageLocked(page));
> BUG_ON(!page_has_buffers(page));
> diff --git a/fs/mpage.c b/fs/mpage.c
> index eedc644b78d7..bcbdb61b24f1 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -486,7 +486,6 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
> struct buffer_head map_bh;
> loff_t i_size = i_size_read(inode);
> int ret = 0;
> - int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
>
> if (page_has_buffers(page)) {
> struct buffer_head *head = page_buffers(page);
> @@ -595,7 +594,7 @@ page_is_mapped:
> * This page will go to BIO. Do we need to send this BIO off first?
> */
> if (bio && mpd->last_block_in_bio != blocks[0] - 1)
> - bio = mpage_bio_submit(wr, bio);
> + bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
>
> alloc_new:
> if (bio == NULL) {
> @@ -622,7 +621,7 @@ alloc_new:
> wbc_account_io(wbc, page, PAGE_SIZE);
> length = first_unmapped << blkbits;
> if (bio_add_page(bio, page, length, 0) < length) {
> - bio = mpage_bio_submit(wr, bio);
> + bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
> goto alloc_new;
> }
>
> @@ -632,7 +631,7 @@ alloc_new:
> set_page_writeback(page);
> unlock_page(page);
> if (boundary || (first_unmapped != blocks_per_page)) {
> - bio = mpage_bio_submit(wr, bio);
> + bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
> if (boundary_block) {
> write_boundary_block(boundary_bdev,
> boundary_block, 1 << blkbits);
> @@ -644,7 +643,7 @@ alloc_new:
>
> confused:
> if (bio)
> - bio = mpage_bio_submit(wr, bio);
> + bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
>
> if (mpd->use_writepage) {
> ret = mapping->a_ops->writepage(page, wbc);
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e49b2406d15d..e6c721f4153b 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -393,7 +393,7 @@ xfs_submit_ioend_bio(
> atomic_inc(&ioend->io_remaining);
> bio->bi_private = ioend;
> bio->bi_end_io = xfs_end_bio;
> - submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
> + submit_bio(wbc_to_write_cmd(wbc), bio);
> }
>
> STATIC struct bio *
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index d0b5ca5d4e08..aa66fa05ff0d 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -100,6 +100,14 @@ struct writeback_control {
> #endif
> };
>
> +static inline int wbc_to_write_cmd(struct writeback_control *wbc)
> +{
> + if (wbc->sync_mode == WB_SYNC_ALL)
> + return WRITE_SYNC;
> +
> + return WRITE;
> +}
> +
> /*
> * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
> * and are measured against each other in. There always is one global
> --
> 2.8.0.rc4.6.g7e4ba36
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 04/18/2016 11:12 AM, Jan Kara wrote:
> On Sun 17-04-16 23:24:41, Jens Axboe wrote:
>> Add wbc_to_write_cmd(), which returns the write type to use, based on a
>> struct writeback_control. No functional changes in this patch, but it
>> prepares us for factoring other wbc fields for write type.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>
> Looks good. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
Thanks Jan, added!
--
Jens Axboe
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 40b57bf4852c..d941f69dfb4b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -39,6 +39,7 @@
>
> #include "blk.h"
> #include "blk-mq.h"
> +#include "blk-wb.h"
>
> EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
> EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
>
> fail:
> blk_free_flush_queue(q->fq);
> + blk_wb_exit(q);
> return NULL;
> }
> EXPORT_SYMBOL(blk_init_allocated_queue);
> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
> blk_delete_timer(rq);
> blk_clear_rq_complete(rq);
> trace_block_rq_requeue(q, rq);
> + blk_wb_requeue(q->rq_wb, rq);
>
> if (rq->cmd_flags & REQ_QUEUED)
> blk_queue_end_tag(q, rq);
> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
> /* this is a bio leak */
> WARN_ON(req->bio != NULL);
>
> + blk_wb_done(q->rq_wb, req);
> +
> /*
> * Request may not have originated from ll_rw_blk. if not,
> * it didn't come out of our reserved rq pools
> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
> int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
> struct request *req;
> unsigned int request_count = 0;
> + bool wb_acct;
>
> /*
> * low level driver can indicate that it wants pages above a
> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
> }
>
> get_rq:
> + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
> +
> /*
> * This sync check and mask will be re-done in init_request_from_bio(),
> * but we need to set it earlier to expose the sync flag to the
> @@ -1781,11 +1789,16 @@ get_rq:
> */
> req = get_request(q, rw_flags, bio, GFP_NOIO);
> if (IS_ERR(req)) {
> + if (wb_acct)
> + __blk_wb_done(q->rq_wb);
> bio->bi_error = PTR_ERR(req);
> bio_endio(bio);
> goto out_unlock;
> }
>
> + if (wb_acct)
> + req->cmd_flags |= REQ_BUF_INFLIGHT;
> +
> /*
> * After dropping the lock and possibly sleeping here, our request
> * may now be mergeable after it had proven unmergeable (above).
> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req)
> blk_dequeue_request(req);
>
> req->issue_time = ktime_to_ns(ktime_get());
> + blk_wb_issue(req->q->rq_wb, req);
>
> /*
> * We are now handing the request to the hardware, initialize
> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error)
> blk_unprep_request(req);
>
> blk_account_io_done(req);
> + blk_wb_done(req->q->rq_wb, req);
Hi Jens,
Seems the function blk_wb_done() will be executed twice even if the end_io
callback is set.
Maybe the same thing would happen in blk-mq.c.
>
> if (req->end_io)
> req->end_io(req, error);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 71b4a13fbf94..c0c5207fe7fd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -30,6 +30,7 @@
> #include "blk-mq.h"
> #include "blk-mq-tag.h"
> #include "blk-stat.h"
> +#include "blk-wb.h"
>
> static DEFINE_MUTEX(all_q_mutex);
> static LIST_HEAD(all_q_list);
> @@ -275,6 +276,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
>
> if (rq->cmd_flags & REQ_MQ_INFLIGHT)
> atomic_dec(&hctx->nr_active);
> +
> + blk_wb_done(q->rq_wb, rq);
> +
> rq->cmd_flags = 0;
>
> clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> @@ -305,6 +309,7 @@ EXPORT_SYMBOL_GPL(blk_mq_free_request);
> inline void __blk_mq_end_request(struct request *rq, int error)
> {
> blk_account_io_done(rq);
> + blk_wb_done(rq->q->rq_wb, rq);
>
> if (rq->end_io) {
> rq->end_io(rq, error);
> @@ -414,6 +419,7 @@ void blk_mq_start_request(struct request *rq)
> rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
>
> rq->issue_time = ktime_to_ns(ktime_get());
> + blk_wb_issue(q->rq_wb, rq);
>
> blk_add_timer(rq);
>
> @@ -450,6 +456,7 @@ static void __blk_mq_requeue_request(struct request *rq)
> struct request_queue *q = rq->q;
>
> trace_block_rq_requeue(q, rq);
> + blk_wb_requeue(q->rq_wb, rq);
>
> if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> if (q->dma_drain_size && blk_rq_bytes(rq))
> @@ -1265,6 +1272,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
> struct blk_plug *plug;
> struct request *same_queue_rq = NULL;
> blk_qc_t cookie;
> + bool wb_acct;
>
> blk_queue_bounce(q, &bio);
>
> @@ -1282,9 +1290,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
> } else
> request_count = blk_plug_queued_count(q);
>
> + wb_acct = blk_wb_wait(q->rq_wb, bio, NULL);
> +
> rq = blk_mq_map_request(q, bio, &data);
> - if (unlikely(!rq))
> + if (unlikely(!rq)) {
> + if (wb_acct)
> + __blk_wb_done(q->rq_wb);
> return BLK_QC_T_NONE;
> + }
> +
> + if (wb_acct)
> + rq->cmd_flags |= REQ_BUF_INFLIGHT;
>
> cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
>
> @@ -1361,6 +1377,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
> struct blk_map_ctx data;
> struct request *rq;
> blk_qc_t cookie;
> + bool wb_acct;
>
> blk_queue_bounce(q, &bio);
>
> @@ -1375,9 +1392,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
> blk_attempt_plug_merge(q, bio, &request_count, NULL))
> return BLK_QC_T_NONE;
>
> + wb_acct = blk_wb_wait(q->rq_wb, bio, NULL);
> +
> rq = blk_mq_map_request(q, bio, &data);
> - if (unlikely(!rq))
> + if (unlikely(!rq)) {
> + if (wb_acct)
> + __blk_wb_done(q->rq_wb);
> return BLK_QC_T_NONE;
> + }
> +
> + if (wb_acct)
> + rq->cmd_flags |= REQ_BUF_INFLIGHT;
>
> cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
>
> @@ -2111,6 +2136,8 @@ void blk_mq_free_queue(struct request_queue *q)
> list_del_init(&q->all_q_node);
--
Regards
Kaixu Xia
On 04/23/2016 02:21 AM, xiakaixu wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 40b57bf4852c..d941f69dfb4b 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -39,6 +39,7 @@
>>
>> #include "blk.h"
>> #include "blk-mq.h"
>> +#include "blk-wb.h"
>>
>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
>> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
>>
>> fail:
>> blk_free_flush_queue(q->fq);
>> + blk_wb_exit(q);
>> return NULL;
>> }
>> EXPORT_SYMBOL(blk_init_allocated_queue);
>> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
>> blk_delete_timer(rq);
>> blk_clear_rq_complete(rq);
>> trace_block_rq_requeue(q, rq);
>> + blk_wb_requeue(q->rq_wb, rq);
>>
>> if (rq->cmd_flags & REQ_QUEUED)
>> blk_queue_end_tag(q, rq);
>> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
>> /* this is a bio leak */
>> WARN_ON(req->bio != NULL);
>>
>> + blk_wb_done(q->rq_wb, req);
>> +
>> /*
>> * Request may not have originated from ll_rw_blk. if not,
>> * it didn't come out of our reserved rq pools
>> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
>> int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
>> struct request *req;
>> unsigned int request_count = 0;
>> + bool wb_acct;
>>
>> /*
>> * low level driver can indicate that it wants pages above a
>> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
>> }
>>
>> get_rq:
>> + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
>> +
>> /*
>> * This sync check and mask will be re-done in init_request_from_bio(),
>> * but we need to set it earlier to expose the sync flag to the
>> @@ -1781,11 +1789,16 @@ get_rq:
>> */
>> req = get_request(q, rw_flags, bio, GFP_NOIO);
>> if (IS_ERR(req)) {
>> + if (wb_acct)
>> + __blk_wb_done(q->rq_wb);
>> bio->bi_error = PTR_ERR(req);
>> bio_endio(bio);
>> goto out_unlock;
>> }
>>
>> + if (wb_acct)
>> + req->cmd_flags |= REQ_BUF_INFLIGHT;
>> +
>> /*
>> * After dropping the lock and possibly sleeping here, our request
>> * may now be mergeable after it had proven unmergeable (above).
>> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req)
>> blk_dequeue_request(req);
>>
>> req->issue_time = ktime_to_ns(ktime_get());
>> + blk_wb_issue(req->q->rq_wb, req);
>>
>> /*
>> * We are now handing the request to the hardware, initialize
>> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error)
>> blk_unprep_request(req);
>>
>> blk_account_io_done(req);
>> + blk_wb_done(req->q->rq_wb, req);
>
> Hi Jens,
>
> Seems the function blk_wb_done() will be executed twice even if the end_io
> callback is set.
> Maybe the same thing would happen in blk-mq.c.
Yeah, that was a mistake, the current version has it fixed. It was
inadvertently added when I discovered that the flush request didn't work
properly. Now it just duplicates the call inside the check for if it has
an ->end_io() defined, since we don't use the normal path for that.
--
Jens Axboe
于 2016/4/24 5:37, Jens Axboe 写道:
> On 04/23/2016 02:21 AM, xiakaixu wrote:
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 40b57bf4852c..d941f69dfb4b 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -39,6 +39,7 @@
>>>
>>> #include "blk.h"
>>> #include "blk-mq.h"
>>> +#include "blk-wb.h"
>>>
>>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
>>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
>>> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
>>>
>>> fail:
>>> blk_free_flush_queue(q->fq);
>>> + blk_wb_exit(q);
>>> return NULL;
>>> }
>>> EXPORT_SYMBOL(blk_init_allocated_queue);
>>> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
>>> blk_delete_timer(rq);
>>> blk_clear_rq_complete(rq);
>>> trace_block_rq_requeue(q, rq);
>>> + blk_wb_requeue(q->rq_wb, rq);
>>>
>>> if (rq->cmd_flags & REQ_QUEUED)
>>> blk_queue_end_tag(q, rq);
>>> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
>>> /* this is a bio leak */
>>> WARN_ON(req->bio != NULL);
>>>
>>> + blk_wb_done(q->rq_wb, req);
>>> +
>>> /*
>>> * Request may not have originated from ll_rw_blk. if not,
>>> * it didn't come out of our reserved rq pools
>>> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
>>> int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
>>> struct request *req;
>>> unsigned int request_count = 0;
>>> + bool wb_acct;
>>>
>>> /*
>>> * low level driver can indicate that it wants pages above a
>>> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
>>> }
>>>
>>> get_rq:
>>> + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
>>> +
>>> /*
>>> * This sync check and mask will be re-done in init_request_from_bio(),
>>> * but we need to set it earlier to expose the sync flag to the
>>> @@ -1781,11 +1789,16 @@ get_rq:
>>> */
>>> req = get_request(q, rw_flags, bio, GFP_NOIO);
>>> if (IS_ERR(req)) {
>>> + if (wb_acct)
>>> + __blk_wb_done(q->rq_wb);
>>> bio->bi_error = PTR_ERR(req);
>>> bio_endio(bio);
>>> goto out_unlock;
>>> }
>>>
>>> + if (wb_acct)
>>> + req->cmd_flags |= REQ_BUF_INFLIGHT;
>>> +
>>> /*
>>> * After dropping the lock and possibly sleeping here, our request
>>> * may now be mergeable after it had proven unmergeable (above).
>>> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req)
>>> blk_dequeue_request(req);
>>>
>>> req->issue_time = ktime_to_ns(ktime_get());
>>> + blk_wb_issue(req->q->rq_wb, req);
>>>
>>> /*
>>> * We are now handing the request to the hardware, initialize
>>> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error)
>>> blk_unprep_request(req);
>>>
>>> blk_account_io_done(req);
>>> + blk_wb_done(req->q->rq_wb, req);
>>
>> Hi Jens,
>>
>> Seems the function blk_wb_done() will be executed twice even if the end_io
>> callback is set.
>> Maybe the same thing would happen in blk-mq.c.
>
> Yeah, that was a mistake, the current version has it fixed. It was inadvertently added when I discovered that the flush request didn't work properly. Now it just duplicates the call inside the check for if it has an ->end_io() defined, since we don't use the normal path for that.
>
Hi Jens,
I have checked the wb-buf-throttle branch in your block git repo. I am not sure it is the completed version.
Seems only the problem is fixed in blk-mq.c. The function blk_wb_done() still would be executed twice in blk-core.c.
(the functions blk_finish_request() and __blk_put_request())
Maybe we can add a flag to mark whether blk_wb_done() has been done or not.
--
Regards
Kaixu Xia
On 04/25/2016 05:41 AM, xiakaixu wrote:
> 于 2016/4/24 5:37, Jens Axboe 写道:
>> On 04/23/2016 02:21 AM, xiakaixu wrote:
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 40b57bf4852c..d941f69dfb4b 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -39,6 +39,7 @@
>>>>
>>>> #include "blk.h"
>>>> #include "blk-mq.h"
>>>> +#include "blk-wb.h"
>>>>
>>>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
>>>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
>>>> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
>>>>
>>>> fail:
>>>> blk_free_flush_queue(q->fq);
>>>> + blk_wb_exit(q);
>>>> return NULL;
>>>> }
>>>> EXPORT_SYMBOL(blk_init_allocated_queue);
>>>> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
>>>> blk_delete_timer(rq);
>>>> blk_clear_rq_complete(rq);
>>>> trace_block_rq_requeue(q, rq);
>>>> + blk_wb_requeue(q->rq_wb, rq);
>>>>
>>>> if (rq->cmd_flags & REQ_QUEUED)
>>>> blk_queue_end_tag(q, rq);
>>>> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
>>>> /* this is a bio leak */
>>>> WARN_ON(req->bio != NULL);
>>>>
>>>> + blk_wb_done(q->rq_wb, req);
>>>> +
>>>> /*
>>>> * Request may not have originated from ll_rw_blk. if not,
>>>> * it didn't come out of our reserved rq pools
>>>> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
>>>> int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
>>>> struct request *req;
>>>> unsigned int request_count = 0;
>>>> + bool wb_acct;
>>>>
>>>> /*
>>>> * low level driver can indicate that it wants pages above a
>>>> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
>>>> }
>>>>
>>>> get_rq:
>>>> + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
>>>> +
>>>> /*
>>>> * This sync check and mask will be re-done in init_request_from_bio(),
>>>> * but we need to set it earlier to expose the sync flag to the
>>>> @@ -1781,11 +1789,16 @@ get_rq:
>>>> */
>>>> req = get_request(q, rw_flags, bio, GFP_NOIO);
>>>> if (IS_ERR(req)) {
>>>> + if (wb_acct)
>>>> + __blk_wb_done(q->rq_wb);
>>>> bio->bi_error = PTR_ERR(req);
>>>> bio_endio(bio);
>>>> goto out_unlock;
>>>> }
>>>>
>>>> + if (wb_acct)
>>>> + req->cmd_flags |= REQ_BUF_INFLIGHT;
>>>> +
>>>> /*
>>>> * After dropping the lock and possibly sleeping here, our request
>>>> * may now be mergeable after it had proven unmergeable (above).
>>>> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req)
>>>> blk_dequeue_request(req);
>>>>
>>>> req->issue_time = ktime_to_ns(ktime_get());
>>>> + blk_wb_issue(req->q->rq_wb, req);
>>>>
>>>> /*
>>>> * We are now handing the request to the hardware, initialize
>>>> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error)
>>>> blk_unprep_request(req);
>>>>
>>>> blk_account_io_done(req);
>>>> + blk_wb_done(req->q->rq_wb, req);
>>>
>>> Hi Jens,
>>>
>>> Seems the function blk_wb_done() will be executed twice even if the end_io
>>> callback is set.
>>> Maybe the same thing would happen in blk-mq.c.
>>
>> Yeah, that was a mistake, the current version has it fixed. It was inadvertently added when I discovered that the flush request didn't work properly. Now it just duplicates the call inside the check for if it has an ->end_io() defined, since we don't use the normal path for that.
>>
> Hi Jens,
>
> I have checked the wb-buf-throttle branch in your block git repo. I am not sure it is the completed version.
> Seems only the problem is fixed in blk-mq.c. The function blk_wb_done() still would be executed twice in blk-core.c.
> (the functions blk_finish_request() and __blk_put_request())
> Maybe we can add a flag to mark whether blk_wb_done() has been done or not.
Good catch, looks like I did only patch up the mq bits. It's still not
perfect, since we could potentially double account a request that has a
private end_io(), if it was allocated through the normal block rq
allocator. It'll skew the unrelated-io-timestamp a bit, but it's not a
big deal. The count for inflight will be consistent, which is the
important part.
We currently have just 1 bit to tell if the request is tracked or not,
so we don't know if it was tracked but already seen.
I'll fix up the blk-core part to be identical to the blk-mq fix.
--
Jens Axboe
Hi Jens,
I am testing current linux-block.git#wb-buf-throttle on top of Linux
v4.6-rc5 here on my Ubuntu/precise AMD64.
( Was installed as a WUBI "test" system - "testing" since April 2012 :-) .)
Here are some numbers:
# df -T | egrep 'sda|loop'
/dev/sda2 fuseblk 465546236 210981868 254564368 46% /host
/dev/loop0 ext4 17753424 15586612 1241936 93% /
# egrep 'sda|loop|ext4' /etc/fstab
/host/ubuntu/disks/root.disk / ext4
loop,errors=remount-ro 0 1
/host/ubuntu/disks/swap.disk none swap loop,sw
0 0
( Not sure why I cannot do a find on /sys/block/ .)
# find /sys/devices/ -name '*wb_stat*' | egrep 'loop0|sda'
/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/wb_stats
/sys/devices/virtual/block/loop0/queue/wb_stats
# find /sys/devices/ -name '*wb_lat*' | egrep 'loop0|sda'
/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/wb_lat_usec
/sys/devices/virtual/block/loop0/queue/wb_lat_usec
# cat /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/wb_stats
/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/wb_lat_usec
background=8, normal=16, max=31, inflight=0, wait=0, bdp_wait=0
75000
# cat /sys/devices/virtual/block/loop0/queue/wb_stats
/sys/devices/virtual/block/loop0/queue/wb_lat_usec
background=16, normal=32, max=64, inflight=0, wait=0, bdp_wait=0
75000
Questions...
Planning a v5?
Will this go to Linux v4.7 or later?
How should someone test?
Documentation...
Can you add some more docs about getting infos (see above cat#s etc.)
below Documentation/ directory?
( Talking about the stuff you have embedded in the commit-messages. )
Unfortunately, I have no more data-volume (using Mobile Broadband
Network @ 56kBps), so I cannot send you my dmesg, linux-config and
patchset.
Regards,
- Sedat -
On 04/26/2016 01:04 AM, Sedat Dilek wrote:
> Hi Jens,
>
> I am testing current linux-block.git#wb-buf-throttle on top of Linux
> v4.6-rc5 here on my Ubuntu/precise AMD64.
> ( Was installed as a WUBI "test" system - "testing" since April 2012 :-) .)
Great! Thanks for testing.
> Here are some numbers:
>
> # df -T | egrep 'sda|loop'
> /dev/sda2 fuseblk 465546236 210981868 254564368 46% /host
> /dev/loop0 ext4 17753424 15586612 1241936 93% /
>
> # egrep 'sda|loop|ext4' /etc/fstab
> /host/ubuntu/disks/root.disk / ext4
> loop,errors=remount-ro 0 1
> /host/ubuntu/disks/swap.disk none swap loop,sw
> 0 0
What kind of device is sda?
> ( Not sure why I cannot do a find on /sys/block/ .)
Probably the symlinks that confuse it.
> # find /sys/devices/ -name '*wb_stat*' | egrep 'loop0|sda'
> /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/wb_stats
> /sys/devices/virtual/block/loop0/queue/wb_stats
>
> # find /sys/devices/ -name '*wb_lat*' | egrep 'loop0|sda'
> /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/wb_lat_usec
> /sys/devices/virtual/block/loop0/queue/wb_lat_usec
>
> # cat /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/wb_stats
> /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/wb_lat_usec
> background=8, normal=16, max=31, inflight=0, wait=0, bdp_wait=0
> 75000
>
> # cat /sys/devices/virtual/block/loop0/queue/wb_stats
> /sys/devices/virtual/block/loop0/queue/wb_lat_usec
> background=16, normal=32, max=64, inflight=0, wait=0, bdp_wait=0
> 75000
>
> Questions...
>
> Planning a v5?
Yes, I'll post a v5 today or something like that. Functionally not a
huge amount of changes, but it does have a few important bug fixes that
make it perform better. The biggest part is making it generic, so it can
be plugged into NFS as well, for instance.
> Will this go to Linux v4.7 or later?
> How should someone test?
Probably a bit too tight for 4.7, but one can always hope. 4.8 is
probably a more realistic target.
Testing is really having some readers while you have writes going on.
One example is doing something that reads while you have a dd writing to
your device. Or interactive feel while installing a lot of packages,
which tends to generate a ton of writes as well.
If you are so inclined, I'd encourage you to test the v5 I'll post later
today. I've run that through the paces on various devices.
> Documentation...
>
> Can you add some more docs about getting infos (see above cat#s etc.)
> below Documentation/ directory?
> ( Talking about the stuff you have embedded in the commit-messages. )
Yeah, I'll do that, it is a bit light right now. It'll be in the next
release.
--
Jens Axboe